DOCOHBIT BBSOBB 



D 206 633 



TH BIO <I7« 



MOTROIt 
ijlTLB 

IBSTITOTIOH 
«POBT HO. 
OB DITE 
fOTB 

^DRS PRICS 
DjBSCRIPTOBS 



IDBBTIFIERS 



Angoff, Rllllaa R.l Schrader, RllliaB B. 

I Stady of AlteraatlTe Bethods foe Equatia? Sights 

Scores to Foriala Scores. 

Edacatlonal Testing Service, Prlncetoa, N.J. 

ETS-RB-81-8 . 

Hay 91 . 

167p. 

HF01/PC07 pIqs Postage. 

Acadealc Ability: College Entrance Szaaiaations; 
*Bqaated Scores: Gaesslng (Tests); Higher Education: 
♦Response Style (Tests): Science lests; Scoring: 
♦scoring Poraulas: Secondary Edacation; *Iesting 
Probleas: ?erbal Tests 

College Board Achieveaent Tests: Graduate B&nageaent 
Adalsslon Test: Invarlance Principle; Scholastic 
Aptitude Test 



BBSTRACT 

The ; .irpose of this study vas to detecaine whether it 
Mould be possible to equate rights-scored to f oraula-scoced tests 
without causing a dlscontlnwlty In the leaning of the sccra scale.. 
Several other subsidiary studies--of the characteristics of the t40 
scoring eethods, of nonresponse and gaesslng, and of reliability and 
parallelisa— were also undertaken. The study was condactet in two 
phases: (1) of two forss of the verbal section of the Scholastic 
liptltude Test and one fora of the College Board Cheaistry JLchieveaent 
Test: and (2) of operational and experiaental subtests of the 
Qraduate nanageaent Adalsslon Test. It was found that the data of 
this study support the hypothesis that foraula scores for tests 
aaslftiste-red with rights directions are directly soaparable to 
formla scores for the i^tae tests adainlstered with foraula 
directions. Thus, the directions under which a test is adainlstered 
tian be changed without serious concern that a discontin\.ity in the 
score scale will result. (Aathor/6K) 



« 



1 



* Beproductlons supplied by EDBS ac^ the best that can be sade * 
e froa the original docuaent. * 

♦»♦♦»♦♦»»»»«»♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦*•♦*♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦*♦** 



ERIC 



A STUDY OF ALTERNATIVE METHODS FOR EQUATING 
RIGHTS SCORES TO FORMULA SCORES 



Wiiliim H. Angoff 
William B. Schnctor 



UA INMUmMMT OP HNICATIOM 

WATIOSIAL IWtTfTUn Of EOOCATKHi 
EDUCATIONAL R6S0UHCCS INFORMATION 
CENJ^ (ERIC) 
dooum«n! hn timn nproAictd at 
notiMd from tfit p«nofi of orgwibcllon 



O Minor dmngm hmn btw to knproiw 
rapraduclion quaMy. 



• PoinliofvltworQpiniora 
(iMdonotnocMMfiy 
potMon or poKcy, 



itMdinlhitdoou- 
ripfMtnt offlcW ME 




CtftieatlOfMl TMtNiQ f^rvlct 
May INI 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY | 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC) " 



A Study of Alternative Methods for Equating 
Rights Scores to Formula Scores 



William H. Angoff 
and 

William B. Schrader 



Educational Testing Service 
May 1981 



ERLC 



3 



Copyright© 1P81. Educational Testing Service. All rights reserved. 

ERIC 



A Study of Alternative Methods for Equating 
Rights Scores to Formula Scores 

Abstract 

The principal purpose of this study was to determine whether It would 
be possible to equate Rights-scored tests to FonPula-scored tests without 
causing a discontinuity in the meaning of the score scale. Several other 
subsidiary studlee— of the characteristics of the two scoring methods, of 
nonresponse and guessing, and of reliability and par^lellsm— were also 
mdertaken. The study was conducted In two phases: 1) of two forro^ of the 
SAT-verbal and one form of the College Board Chemistry Achievement Test, 
based on data from a specla:^ experimental test administration; and 2) of 
operational and experimental subtests of the Graduate Management Admission 
Test, based on data from the (regular) October 1980 administration of the 
GMAT to applicants for business school. Several outcomes of the study were 
observed that will be useful for the understanding of some of the Issues of 
Rights and Formula scoring. In addition, it was found that the data of 
this study support the hypothesis that Formula scores for tests administered 
with Rights directions are directly comparable to Formula scores for the 
i^me tests administered with Formula directions. Thus, the direrMons under 
which a test is administered can be changed without serious concern that a 
discontinuity in the score scale will result. 



I 

Acknovledgmer t s 



The authors gratefully acknowledge the contributions of Paul W, 
Holland, Frederic M, Lord, and E. Elizabeth Stewart, who advised us 
on the analysis and interpretation of the* data; of M^rk Barton, v^ho 
was responsible for the statistical calculations; and of Doris Conway, 
who provided valuable assistance throughout the study. 

William H. Angoff 
William B, Schrader 



6^ 



A STUDY OF ALTERNATIVE METHODS FOR EQUATING 
RIGHTS SCORES TO FORMULA SCORES 



Contents 



Issues In Rights vs Formula Scoring *. 1 

A Study of College Board SAT-veibal and Chemistry Tests, Based 

on Special Test Administrations 

Questions Addressed by the Study 16 

^ IQ 
Study Design 

, Sample Characteristics 

Test Administration ' 

Results 

Effect of Directions on Mean Formula Scores: Method 

of Analysis . • 34 

Effect of Directions on Mean Formula Scores: Findings ... 36 

Effect of Directions on Nonresponse . 

Effect of Directions on Guessing 

Effect of Directions and Scoring on Reliability and 

Parallelism * 

Effect of Directions on Equating: Methoc^ of Analysis ... 69 

Effect of Directions on Equating: Findings' ........ 77 

87 

Summary and Conclusions • 

A Study of the Graduate Management Admission Test, Based on Program 

Data 

Questions Addressed by the Study 91 

94 

Study Design - 

QQ 

Sample Characteristics 

100 



Results 

Effect of Directions on Mean Formula Scores 



ERLC 



7 

y 



Effect of Directions on Nonresponse 102 

Effect of Directions on Guessing 1 117 

Effect of Directions and Scoring on Reliability 

and Parallelism 125 

Effect of Directions on Score Equating: Method 

of Analysis , 131 

Effect of Directions on Score Equating: Findings .... 135 

Summary and Conclusions 139 

Implications of Findings 



S 



ERIC 



vi 



i 



ISSUES IN RIGHTS VS FORMULA SCORING 



As Is well known, the controversy over the Formula-scoring vs Rights- 
scoring Issue has continued without much loss of force for more than fifty 
years, and discursive articles and reports of empirical studies appear In 
the literature on this and related subjects with no less frequency today 
than In earlier years. The considerations that have persuaded the writers 
on this topic to adopt one position or the other have ranged widely from 
Iskueg having to do with the reliability and validity of the scores through 
the practicalities of administering a testing program under conditions under- 
standable and acceptable to the examinees, to Issues of special tactics, 
ethics, and morality In the taking of tests, differences In personality among 
examinees and their effects on the (cognitive) test scores, and considerations 
of equity and fairness to all examinees. Beyond the matter of the particular 
choice of the admlnlstratlon-and-scorlng system Is the further complexity of 
changing from one to the other without causing a discontinuity In the meaning 
of the scale of scores, a scale that Is intended to have the same meaning 
Independently of the form of the test admlnf^^red', the' time and place of 
administration, or the nature of the examinees tested. 

In order that there be no dcubt about what Is Intended In the use of 
terms and expressions In discussing Formula-vs-Rlghts Issues, It should be 
understood that answer sheets completed and submitted by examinees are 
appropriately scored Rights, a simple count of the number of Items answered 



ERIC 



correctly,- If the examlliees have been Informed, prior to their taking the 
test, that their answer sheets will be scored In this way. This Information 
would normally be accompanied by advice to the examinees that, since there 
Is no penalty for Incorrect responses, they should not omit any Items In 
the test. Such words of advice, or directions, may be further described 
^as penplsslve Instructions, In the sense that examinees are encouraged no 
guess. In cor-trast to Rights scoring. Formula scoring procedures are those 
In which an explicit penalty for Incorrect responses Is' used, typically of 
the form, F (Formula) = R - where R and W are 'the counts of right and 

wrong responses, respectively, and k - the number of choices per item. The 
Formula score Is not ordinarily Intended as a punitive device. The Intent 
cf the formula Is to yield a net score of zero, on the average, for the 
aggregate of all Items that the examinees mark purely at random. It Is 
understood here too that prior to taking a Formula-scored test the examinee 
will be Informed that this will be the mode of scoring, and he or she will 
be cautioned that pure guessing Is risky and could lower the score. Often 
added to these Instructions Is the advice that urges the examinees to 
eliminate those options that they know are clearly incorrect, and to guess 
from among the others. Such instructions and advice may be interpreted as 
restrictive instructions, in the sense that examinees are discouraged from 
^ guessing. 

It has been pointed out (Diamond and Evans, 1973; Ebel, 1965; Stanley, 
1954) that Rights scores and Formula scores are very highly correlated--when 
all examinees answer all items, the correlation is perfect-and therefore, 
it might be Inferred, it matters very little whether the papers are scored 



10 



g one way or the other. It Is Indeed true that, given a set of completed 

answex sheets, the two types of scores are highly correlated. But, clearly, 
this correlation is spurious; the two scores are based on the same set of 

responses to the same set of items. What is different is simply the method 

J- 

of dealing with those responses. Further, what is not recognized in making 
this inference is that in any operational test the method of scoflng muct 
go hand in hand with the instructions given to the examinees, and that the 
test-makers are not at liberty, once they have given the examinees instruc- 
tions to guess, or not to guess, to score the answer sheets otherwise 
(Cureton, 1966; Davis, 1967). It may be assumed— and there are sufficient 
data to support this assumption — that examinees adopt a strategy for 
guessing, or not guessing, t^aking the prior information and advice in good 
faith and assuming that their papers will in fact be scored as they were led 
to believe they would be. Therefore, the scoring for each individual tested 
must be performed by the method specified in the directions under which the 
test is administered. 
Rights vs Formula: Pros and Cons 

As has beer implied at the beginning of this paper, the literatim on 
the issue of Formula scoring and Rights scoring is voluminous, and several 
excellent reviews of the literature on the topic ai^ available, including 
extensive reviews by Abu-Sayf (1979) and Diamond and Evans (1973), and a 
brief overview Thorndllce (1971, 59-61). Consequently, jno effort will be 
made in this paper to conduct an exhaustive review or to evaluate the present 
state of opi5^.on regarding the various questions regarding Rights and Formula 
scgring. Instead, a brief summary Is made of opinions and findings that 



11 



bear on the issues as they relate to the conduct of extensive testing pro- 
grams. 

The principal arguments that have been advanced In support of Rlghts- 
Instructlons-and-scorlng and In opposition to Formula-instruct lons-and- 
^scoring are as follows: 

^ ,1. Rights instructions and advice are much easier for examinees 

to understand and to follow. Formula advice requires examinees to evaluate 
their knowledge of -the content of the ltem~e.g., to know when one or more 
incorrect options can be elimlnatftd—ln deciding whether or not to guess. 

1. Many examinees fall to understand the logic of the formula, 
and experience some a:ixiety about the risks they are asked to take. 

' 3 As a consequence of the foregoing. It Is very difficult to 
write directions to accompany a Formula-scored test that are short, clear, 

and under stt^ndabie. 

/. It Is probably more difficult to attain virtually error-free 
yicoring when Formula-seorlng Is used than when Rights-scoring is used. 

5. Because of the considerations in (1), (2), and (4) above, some 

\ 

Investigators claim that a Formula-scorW test lntroduces*lrrelevant sources 
of error variance Into the test, stemming from differences among examinees 
with respect to: their ability to understand Formula-scone directions and 
V take optimal action (Abu-Sayf, 1979); their leveis of confidence in assess- 
Ing their dyn degree of knowledge (Slakter, 1968b); their willingness to take 
a risk (Sherr^fs and Boomer, 1954; Slakter, 196^ 1968a, 1968b, 1969; Votaw, 
193^ and their general levels of confidence lil the testing situation. 

6. Glass and Wiley (1964) have presented both theoretical and 

\ ^ 

« 

12 



-5- 

empirical evidence In support of the position that Rl£' ts-scored tests are 
more reliable than Formula-scored tests. On the other hand. Lord (1963, 
1975) and Lord and Novlck (1968, p. 308) argue effectively for the position 
t\mt Rights-scored tests are less reliable than Formula-scored tests. 

7. Arguments and data have been Introduced by several Investlga- 
'tors~e.g.. Cross and Frary, 1977; Cureton, 1966; Rowley and Traub, 1977; 
and Slakter, 1968a, 196ab~ia support of the assertion that the Formula- 

' score directions and advice given In connectl9n with Formula-scored tests 
Is bad advice; examinees would be better advised to guess, even In a Formula- 
scored test, since their scores would be higher than If they did not guess. 
This would puggest^hat li. general students know more about the content of 
41 test than they think they do. Ar Implication of this position Is chat 
students t:e8te<p»ith Formula directions are put at^a disadvantage relative 
to those trsted with Rights directions. (This assertion will be referred to 
later In' this paper as the Differential Effect Hypothesis, since Its claim 
is that Formula directions tend to reduce some examinees' scores artlf Icaliy.) 

8. If the assertion in (7) is supported by data, and if it is also 
true that lou-scorlng examinees are less Inclined to guess than are higher- 
scoring examinees, then it would follow that Formula directions and Formula- 
scoring tend to depress low scores still further. It would therefore also 

-follow that Formula directions and Formula-scoring would be especially dis- 
advantageous to minority students, who earn lower scoresi on t^^verage, _ ^ 
on most tests. Ebel (1968) found in fact that low-scoring examinees are 
slightly'^^^e Inclined to guess than higher-scoring examinees. But addi- 
tional confirmatory data would be useful. 



9. The Differential Effect Hypothesis also holds that students 
who are less inclined to take a risk, and therefore less inclined to guess 
at items they are not sure of, are further disadvantaged by the Formula- 
directed test (Sherriffs and Boomer, 1954; Slakter, 1967, 19688, 1968b, 
1969; Votaw, 1936). 

10* The argument has been advanced tbit Rights scores provide 
discriminations below mean chance, a point which is defined as a zero raw 
score, in Formula-score tenos* Eoldt (1968) and Levine and Lord (1959) have 
in fact demonstrated that there is a valid discrimination provided by scores 

below mean chance, although not as much as is provided elsewhere on the 

ft 

score scale* In recognition of this fact, some tissting programs report 
scaled scores corresponding to negative Formula raw scores — i.e., Formula 
scores below mean chance. The perhaps obvious point should be made here 
that "mean chance" is the mean raw score one would earn If all items one 
.nswered were answered -Mom. However, this is not to say that scores 
at and below "mean char«':«^' were in fact earned by responding at random. 
On the other hand: 

1. There have been objections to Rights (permissive) directions 
on the ground that they encourage indisc^-imlnate guessing, especially when 
the examinee has insufficient time uear the end of the test to consider the 
remaining items carefully. The argument goes on to say that indiscriminate 
guessing is educationally deplorable^ because it focuses the stud^.nt's 
interest on improving his or her test score, without sufficient regard to 
the educational outcome being assessed by the test/ Th« indl^criminAte — . 
guessing occurs, especially toward the end of the test, cannot be denied. 



14 



-7- 



Numerous instances have been observed of "pattern-marking" the last several 
Items In tests scored Rights, Indicating clearly that random responding has 
taken place. 

2. Lord (1963) has advanced theoretical arguments that because 
of the random guessing component, which add* error variance of Its own. 
Rights-scored tests are J^ss valid than Formula-scored tests; and under the 
assumption that omitted items would be replaced with random responses. Rights- 
scored tests are also less efficient than Formula-scored tests (197A). Lord 
also demonstrates (1977) that difficult tests are extremely unreliable for 
low-scoring examinees because they guess more often than they should and, 
accordinp to Ebel (1968), more often than do other examinees. These tests. 
Lord fXi»as, would be more reliable for low-scoring examinees if the tests 
were shortened by removing the more difficult items, 

3. Further, there is some concern that data provided by items 
placed near the end of the test will yield poor estimates of ability in a 
permissive (Rights-directed) situation because some relatively able e^i:amlnees 
will mark the last few items (those that they do not have time to consider 
more carefully) at random, thereby causing those items to appear to be more 
difficult for examinees at that level of ability—and more error-laden— than 
they actually are.' 

4. With regard to the assertion that examinees are better advised 
to guess than to omit, on the basis that they generally know more than they 
think they do, there is evidence to show that this is a more complex issue 
than it may appear to be. While it is entirely possible that some examinees, 
those who are less prone to take risks but who have partial information, ipay 



ERLC 



J -5 



-8- 



Improve their Formula scores by guessing, there are others, especially 
lower-scoring students, who would do worse, because their Information Is 
often misinformation. That this Is true Is shown by the preponderant 
numbers of below-chance ac>ymptotes sometimes observed In Item response 
curves (Lord, 1980, 110) • It Is also observed In the large numbers of 
Items with smaller-than-chance r^-oportlons of correct responses made by 
low-ablllty examinees, as evidenced In typical Item analysis outputs* 
Finally, the data, referenced above, by Boldt (1968) and by Levlne and 
Lord (1969) , demonstrate that below-chance scores are not merely random 
departures from a chance mean; they are valid scores, earned, very likely, 
as a result of misinformation. Students scoring at those levels would have 
profited, even on a Formula-scored test, by guessing blindly or by omitting 
Items that were too difficult for t^em rather than by attempting to answer 
them* 

5* The advantages Xo the examinee of. Rights directions are not 
as clear as has been claimed. Even when Instructed to guess when they are 
In doubt, some examinees will fall to respond to every Item* Whether these 
examinees leave some Items blank because they do not understand the Instruc- 
tions, because they do not trust them and perceive a risk In responding, or 
because they do not have enough time to redpond Is not known. The fact 
remains, however, that Rights directions do not Insure that examinees will 
respond to every Item. 

6. It has been suggested (L. R Tucker, personal communication) 
that Formula" -•scoring as such may compensate for differences In instructions, 
that even when different students (of equal ability) have been given 

16 

(- 



ERIC 



different instructions regarding guessing. Formula-scoring will tend to 
equalize the scores. This is a major consideration. Since the directions 
given to examinees are intended to influence guessing strategies, this 
hypothesis further suggests that Formula-scoring may also tend to com- 
pensate for individual differences among examinees with respect to their 
guessing strategies. If this is so, it would argue that tests should be 
Formula-scored; and consistent with th t method of scoring, tests should 
be administered under restrictive instructions with respect to guessing. 

This hypothesis will be referred to later in this paper as the Invariance 
Hypothesis, since its claim is that Formula scores are invariant with respect 
to directions for guessing.) 

In recent months serious consideration has been given to the pros and 
cons outlined above in evaluating the desirability of moving from the Formula- 
direct ions-and-Formula-scoring mode, which characterizes most of the 'large- ; 
scale testing programs administered by Educational Testing Service today, to 
the Rights-directions-and-Rlghts-scoring mode. An important factor in con- 
templating such a change, however desirable such a change may be from the 
othei points of view, is whether it can be effected without a discontinuity 
In the score scale. For example, while it certainly is true (as was dis- 
cussed above) that for a given set of individuals to whom a test has been 
given under one of the two types of directions, the correlation between the 
two methods of scoring will be (spuriously) very high, it is also trua that 
the mean and standard deviation of these scores will be quite different: 
Formula-scoring will inevitafily show a lower mean and higher standard de- 
viation of raw scores than Rights scoring. However, a good deal more than 



-10- 



a simple shift In mean and standard deviation is Involved here. Since 
the mode of scoring cannot be arbitrarily chosen but must be consistent 
with the particular types of directions used in the test administration 
itself, we must necessarily deal with the issue as one involving a con- 
version of a test administered with Rights directions and scored Rights 
to the scale of a test administered with Formula directions and scored 
Formula. Not only the scoring method but the strategic orientation of 
the examinee is at issue; and it is the combined effects of the behavioral 
and scoring components of the change that cause a shift in the scores and 
a possible discontinuity in the scale. Such a discontinuity, if it occurred, 
would mean that scores earned after the shift would not be strictly com- 
parable to those earned before the shift. 

The present study was undertaken, therefore, in an attempt to investi- 
gate the psychometric feasibility of changing from the present restrictive/ 
Formula mode to the permissive/Rights mode without endangering" scale 
continuity. In the process of planning the study, care has been taken in 
the design of special administrations, and analyses based on data resulting 
from those administrations, to investigate various methods of equating that 
could be realistically undertaken in the context of an operational testing 
program, but also to collect data which are designed to cast some light on 
one or more of the issues on which Formula-score and Rights-score adherents 
are presently divided. 

Technical Problems In Equating Rights Scores to F ormula Scores 

It is generally known that all of the large-scale testing programs 
administered by Educational Testing Service report scores to examinees and 



^ 18 

ERIC 



-11- 



to test users on a continuing equated scale, a scale which, in nearly all 
instances, is clearly different and distinguishable from the raw scores of 
any particular test form. The purposes behind the development and mainte- 
nance of this scale are clear. The roost important of these purposes is 
that the scores of students taking rhe tests at different times and at 
different administrations are thus made directly comparable, in spite 
of the inevitable variations in difficulty that exist from one form to' 
another; and no examinee is put at an advantage or disadvantage in relation 
to other examinees because of the particular level of difficulty of the 
test form that he or she happens to take.. From the score user's point of 
view this system also has distinct and important advantages: It offers 
the freedom of using the scores from whatever test form may be conveniently 
available and free of compromise. There is, additionally, the advantage 
ttiat scores can be compared across students and groups of students without 
regard to the test forms that yielded those scores; and special studies 
can be carried out, aggregating data across forms, plotting trends, and 
studying the effects of intervening time and treatments on scores. These 
freedoms are all made possible by a complex system of form-to-fonn equating 
to the underlying scale .ordinarily carried out immediately following the 
administration of each new form of the test in each of the large-scale 
testing programs. 

Although there are, as expected, variations and modif icatione in 
" particular details, one can categorize the equating methods used in tho 
large-scale programs as falling into one or the other of two types. In one 



ERIC 



-12- 

of these, the new form is "spiralled*** with one or more old forms, and 
administered, one form to each examinee taking the test at the adminis- 
tration in which the new form is first offered. Using statistics based 
on the groups of examinees taking the two or more forms, the scores on 
the new form are equated directly to the scores on the old form(s) in a 
procedure based on the assumption that the separate groups are equivalent 
in all respects, but in particular, with respect to the distribution of 
the abilities in question. 

The foregoing method of equating is feasible, however, only in those 
programs in which more than one test form may be administered at one time. 
In other testing programs it is customary for all examinees taking the test 
at a given administration to take precisely the same operational test form. 
As a result—and since examinees taking the test at different administrations 
are known to differ— it is not possible to identify equivalent groups of 
students to use for equating. As an alternative device, the method of 
equating used in these circumstances employs a set of "common items." These 



*"SpiralUng" is a term employed at ETS to describe a method of distri- 
buting test forms to obtain systematic samples of examinees. When there are 
two forms to be administered, the forms (say Forms X and Y) are packaged in 
alternating sequence— X, Y, X, Y, X, ... —and distributed to successive 
students as they are removed frou the top of the package of test books. When 
there are three forms (X, Y, and Z) , they are packaged X, Y, Z, X, Y, Z, X, 
Y, and similarly distributed. When the total group of individuals to be 

tested 1« of size and tnere are m tests, or test forms, to be spiralled, 
there will be N/m complete spirals, or cycles, of test foms to b^ distributed, 
and the^th individual in every complete cycle will receive the same test form* 
Thus, if, for example, seven forms are spiralled, the 3rd, 10th, 17th, 24th, 
31st, ... individuals in the group will receive the same form of the test. 
When the test books are separated by form, the samples of Individuals, each 
receiving a particular test form are (except in highly unusual circumstances) 
essentially stratified samples and more nearly equivalent than If random 
sampling methods were employed. 



20 



\ 



-13- 



are items that were administered to the examinees when the old form was 
first given and are given again to other examinees at the time they take 
the new form. The ssnse of the "commonality" of these items goes beyond 
the printed test, however; it also implies, and requires, that the condi- 
tions of measurement— the psychological task represented by the items— be 
the same for both groups of students, because it is on the basis of their 
obsf.rved performance these ite:as that statistical adjustments are made 
to compensate for the fact that the groups may not be equivalent. It 
is the conditions of measurement, clearly, in addition to the content of 
the items themselves, that also account for performance. The intent of 
this method of equating is to simulate by statistical adjustments the 
random (actually, systematic) sampling method described above. 

Both of these gereral methods of equating can operate, and have 
operated, quite well in the context of secure testing programs, in which 
it has been possible to protect old forms from compromise so that they can 
be used again without giving the new groups of students special advantages. 
In the context of an open-disclosure environment, as has been enacted in 
New York State, however, severe constraints are imposed on the methods of 
equating. The first method of equating, which depends on the spiralled 
administration of the new form along with one or more old forms, is no 
longer poasible, since the old forms would not be secure. The alternative 
method, which Involves a set of "common items,'' is possible only when the 
"common items" are nonqperational , that is, do not count toward the 
examinee's score, That method, it should be noted, is feasible onlr within 
the latitude permitted by the present New York State law, since nonoperational 



21 



-14- 

items are protected from disclosure by the New York law; laws presently 
being considered in other jurisdictions may not permit this latitude. 

The "common item" method has worked quite satisfactorily in the 
past when the content of the test and the conditions of administration 
can be adequately represented in miniature in the £<»t of common items. 
But when, as would normally be the case in considering a shift from 
Formula to Rights scoring, the old form and the common items are adminis- 
tered with restrictive instructions and scored by Formula ♦ and the new 
form and the common items are administered with permissive instructions 
and scored Rights, and, finally, when the groups taking the forms in these 
two administrative modes are not in any sense randomly equivalent groups, the 
usual methods for equating are inapplicable. The scores on the common^ items 
do not have the same meaning in the two contexts. What remains is the possi- 
bility that the Invariance Hypothesis, defined above, cafi be utilized ^in the 
eqXiating process. This hypothesis, it is recalled, states that differences in 
the* scores earned by two randomly selected groups who have been given different 
instructions to guess tend to be minimal when the test papers for the two 
groups are scored by Formula. The principal purpose of the present study is 
to test this hypothesis. If the data support it, then the sets of common 
items for the examinees taking the new form — those receiving Rights directions- 
can be rescored by Formula, allowing direct comparisons between the two groups 
to be made on the common items in Formula-score terms-'^in the process of 
equating the new and old forms. Even if the hypothesis is not fully 
supported, it is possible that the data of the study will provide informar 
tion to aid in developing appropriate adjustments to overcome the remaining 
bias. 

22 



-15- 



Stlll another possibility remains open, although it too represents 
some risks. If it can be shown that examinees can shift, when they are 
instructed to do so, from one set of test-taking strategies to another, 
then the new forms to be administered in the future may be administered 
under Rights directions and scored Rights, and the sets of common items 
(if they appear as a separately-timed block) administered under Formula 
directions and scored by Formula. Such an administration would enable 
the direct comparison of Formula scores on the equating sections. The ; 
risk here is that examinees may be able to identify the nonoperational 
section because of its different instructions, and perform on it at a re- 
duced level of motivation, attention, and care. 

An additional concern, alluded to eai^lier, in the equating of Rights- 
adrolnistered-and-scored tests to Formula-administered-and-scored tests is 
the possibility of introducing ai\ additional dimension into the measurement, 
which may effectively result in the "equating" of nonparallel tests. That 
'is, if the data suggest that the tests represented substantially different .- 
psychological tasks, then there will be some considerable question as to 
the generality of the meaning of the "equating," however it iscarrted out. 



ERIC 



-16- 



A STUDY OF COLLEGE BOARD SAT^VERBAL AND CHEMISTRY TESTS. 
BASED ON SPECIAL TES|, ADMINISTRATIONS 



^ Questions Addressed by the Study 

The principal purpose for which this study was undertaken was to 
Investigate the effectiveness of several methods of equating scores that 
had been earned under conditions of Rights directions and scoring to scores 
earned 'under conditions of Formula directions and scoring. Information 
tn this subject is of vital importance if an operational testing program 
that has administered and scored its tests in the Formula mode is to be 
capable of shifting to Rights directions and scoring without introducing 



a discontinuity of its sgfore scles. 

rtid 



In the course of studying the equating methods it was deemed neces* 
sa^'y to investigate other related questions: 

1. Tc what extent Ac the results provide a firm basis for 



itlal^ 




choosing between the Invariance Hypothesis and the Different 
Hypothesis? 

^s^JThe Invariance Hypothesis and the Differential Effect Hypothejsfs 
differ essentially in their predictions regarding how well students would 
perform if, instead of choosing to omit certain items when tested undet 
Formula directions, they chose to answer them. The Invariance Hypothesis 
implies that their performance on the omitced items would be, on the 
average, neither better nor worse than would be expected by chance. The 
Differential Effect Hypothesis, on the other hand, implies that their 



24 



performance on those items would be better, on the average, than would 
be expected by chance. If the Invar iance Hypothesis is true. Formula 
scores would remain the same, on the average, whether or not the students 
chose to omit items about which they had insufficient basis for answering. 
If, however, the Differential Effect Hypothesis is true, students who 
choose to omit certain it^s when tested under Formula directions would 
be at a disadvantage in comparison with other students of equal ability 
who answered all the items. 

Although the same student cannot take the same test under both 
Rights and Formula directions at the same time, it is possible to admin- 
ister 'the same test so that one random half of a large group is tested 
with Rights directions and the other half is tested with Formula di- 
rections. The Invariance Hypothesis would predict that the two groups 
would have virtually equal mean Formula scores; the Differential Effect 
Hypothesis would predict that the group tested under Rights directions 
would have a higher mean Formula score than the group tested under Formula 
directions. . ^ ' 

2. To what extent do Formula directions affect .the number of 
items considered but intentionally omitted, the number of items not 
reached, and the total number of items not attempted? 

3. To what extent do students comply with the instructions 
given to tnem and change their strategies with respect to guessing con- 
sistent with those instructions? 



-18- 



4. When students are stratified on the basis of ability. Is 
there a discernible difference between high- and low-ablllty students In 
the effect of Formula and Rights directions on the average number of 
Items omitted, not reached, or not attempted? Do Black studentd show 
the same results as the total group? 

5. Does a guessing index defined as "Wrongs minus Omits" provide 
useful Information about guessing tendencies that Is not provided by the 
various Indices of nonresponse? 4- 

6. To what extent do Formula and Rights directions yield dif- 
ferent reliabilities, as determined by Internal consistency and paVallel- 
form methods? 

7. Is there reason to believe that the assumption of parallelism 
between a test administered with Rights directions and the same test ad- 
ministered with Formula directions is not^. warranted? 

8. How much confidence can be placed in the Invar lance Hypothesis 
as a basis for equating Rights scores to Formula scores? To what extent 
does the^use of thk Invariance Hypothesis result in systematic differences 
between conversion lines obtained by assuming invariance and coriespondlng 
parameters obtained by traditional equating methods? 




-19- 



Study Design 

Several of the previous studies of Rights and Formula scoring — for 
example, Croias and Frary, 1977; Sherriffs and Boomer, 1954; Slakter, 1968; 
and Votaw, 1936~have' called for the administration of a test under Formula- 
score directions followed either Inraedlately, or after some intervening 
time* by a redistribution of the original answer sheets with instructions 
to revieiW all previously unanswered items and to fill them in, utiing a 
differently colored pencil, with a considered or guessed response (Rights- 
score directions) . In no study that we know of was the order of adminis- 

r 

tration counterbalanced, to determine whether these instructions were 

- ^ A 

• subject to an order effect. In all these studies the students were, 
obvioi^ly, given additional time to reconsider their previously omitted 
responses, more time, in aggregate, * than a normally administered test with 
Formu^j^-score directions would have called for. This is a condition of ^ 
the st:udles that, by itself, would only have had an artificially elevating 
effect on their scores. And, finally, ao Lord (1975) points out, the 
Slakter (1968a) study in particular is flawed because the stuJ^ats were 

/allowed peve^al day^^before the redistribution of answer sheets, during 

* which time they'V'^ret at liberty to compare notes with one another or to 
check oh' doubtful items. 

Unlike the aforegoing stuBiea** the present study was designed to 

* • 

achieve symmetry, but 'did nof'require the examinees to review and respond 

, to a test they had previously taken. It is Indeed the only study we know 
of In which the same test was administered under both Righjts and Formula 
directions. Moreover, the administration of tltfe t&stB was so arranged 



I 



-20- 



that all comparisons would be made between and among experimentally equiv- 
alent groups. Most of the analysis was based on special administrations 
of a form of the College Board SAi -verbal Test that^had first been intro- 
duced in April 1978, to be referred to in this paper as Form A. Like other 
current forms of the SAT, Form A is administered in two separately-timed 
half-hour sections, 45 items in the first section and AO items in the 
second section. Both sections contain items of four types: anton)rms, 
analogies, sentence comJM^t^oa, and reading comprehension. Current oper- 
ational practice is to administer the SAT-verbal with restrictive instruc- 
tions and to score it by Formula, R - | W, since all items are five-choice. 
Four sets of directions for administration were prepared for the present 
study, identical in all respects except for instructions regarding guessing. 
The following table describes how the four sets of instructions for Form A 
were administered. Again, it is recalled that* Rights directions are 
permissive with respect to guessing; Formula directions* are restrictive 
with respect to guessing. 

Directions for Administering SAT-verbal, Form A 
Set Part (Section) 1 Part (Section) 2 

1 Rights Rights 

2 Rights Formula 

3 Formula Rights 

4 Formula Formula 

Additional, confirmatory analyses were based on the administration of a 

/ 

second form of the SAT-verbal, Form^, a form which was first introduced 



ERIC 28 



-21- 



operatlonally In June 1976. As just suggested, the analyses based on the 
administration of Form B were not Intended to be as detailed as those based 
on Form A. They were undertaken to provide assurance that the results of 
the main analyses were not Idiosyncratic to Form A. Form B was constructed 
to parallel Form A In content. Item type, number of Items, difficulty, 
discrimination, and speededness. Two sets of instructions wercj prepared 
for the administration of Form B, identical in all respects to those in 
Set 1 and Set A, above. The following table describes the instructions 
for the administration of Form B. 

Directions for Administering SAT-verbal. Form B 
Set Part (Section) 1 Part ^Section) 2 

5 Rights Rights 

6 Formula Formula 

Special test booklets were prepared for each of the six sets described 
above. Four of the sets of test books contained Form A items, the other 
two contained Form B items. Each set contained Rights and Formula instruc- 
tions for Part 1 and/or Part 2, as described above. Inasmuch as the timing 
for all of the six sets was identical, it was possible to administer all 
six to different students in the same testing room at the same time. It 
also permitted "spiralling" the test books in the order: 1, 2, 3, A, 5, 6, 
1, 2, 3, A, .... and the distribution of the test books to the students 
in that order, with the result that every sixth student received the same 
. cest book. By means of this procedure of (systv itic) sampling, it was 
possible to achieve very nearly equivalent groups of examinees taking the 



Er|c 23 



-22- 



different test-and- instructions. In fact, the groups formed with this 
method of sampling were more nearly equivalent than would have been 
obtained with random sampling methods. 

The Sample of students chosen for the administration of thfe SAT-verbal 
was drawn from a population of students who were likely to be taking the SAT 
for admission to college. (The specifics of the definition of that popula- 
tion and the selection of schools ere described In the following section.) 

As a further check on the results of the main analysis, based on Form A 
of the SAT-verbal, the Chemistry Achievement Test of the College Board Ad- 
missions Testing Program was also administered, but to an entirely different 
sample of students, a sample drawn from students taking first-year chemistry 
In high schools that have relatively large numbers of students who take the 
College Board Achievement Tests. The form of the Chemistry Test used in 
this study was one first Introduced ^n January 1969. Although not as new as 
the SAT-verbal forms referred to above, this form of the Chemistry Test was 
re-examined and judged suitable for the experiment as well as for current 
operational use. 

Unlike the SAT, the Chemistry Test is administered under a single 
one-hour time limit. But for that difference, the test books for Chemistry 
were prepared in the same way as were the test bookfi for SAT-verbal, Form 
B. Two types of books were prepared, containing identical items, but 
differing with respect to instructions, as shown in the table below. 



Set 



Directions for Administering the Chemistry Test 



7 



Rights 



8 



Formula 



30 



-23- 



In those schools chosen, and agreeing, to take the Chemistry Test, test 
books 7 and 8 were spiralled, so that every odd student took one and 
every even student took the other. 

In preparation for the latBt^ anlyses , eight groups of experimental 
subjects were formed, each corresponding to the eight sets described 
above, and designated accordingly. 

The task of the test supervisor In both types of administrations was 
limited to the presentation af the following Information and Instructions: 
Informing the students regarding the fact that they would have different 
types of Instructions to guess; Instructing them with regard to the re- 
quired procedures for Identifying themselves and for marking the answer 
spaces; asking for the Identification of sex, the Identification of ethnic 
group (American Indian, Asian American, Black/Afro American, Caucasian, 
Chlcano/Mexlcan American, Puerto Rico /Puerto Rican American, Spanish 
American, or Other); asking for the students' rank in class (to the 
nearest fifth); and, finally, timing the test. 

The students participating in the study were ^also asked whether they 
wanted their scores to be sent to them and their high school. If they 
did, their names and their scores on the SAT-verbal (or Chemistry) scaled 
score scale were reported to their high school, with instructions to the 
high school to transmit the scores to the students. 




ERIC 



-24- 



Sample Characteristics 

The sample design called for testing relatively large numbers of 
students for whom the test would be appropriate and who were willing to 

_ _ _ _ V _ 

participate. It also called Tor Including a relatively large proportion 
of minority group members, particularly Black students, in the SAT-verbal 
sample. Because the SAT-verbal sample was divided into six subgroups, a 
target figure of 9,000 students, including 2,000 minority students, was 

a 

used in planning the SAT-verbal sample. For Chemistry, the target figure 
for sample size was 2,000. Because the study was concerned with equating 
and subgroup comparisons, which do not require a typical crois-section of 
examinees, the sample design, as described, was considered to be appropriate. 

The selection of schools for the SAT-verbal sample took account of " 
relevant data about schools available in the College Board statistical data 
file. The selection process used a iiiting of all schools having 50 or more 
College Board candidates in 1978-79 and having a school code ending with 
either of two (randomly selected) last digits. For each school having 100 
or more candidates, percent minority and percent Black* were also listed. 

The first step in designing the sample called for estimating the 
number of prospective examinees in the group of schools to be invited. 
Because of limitations in available data ou schools, because of uncertainty 
regarding the proportion of schools and examinees that would choose to 
participate, and because the time schedule was too tight to permit much 
replacement of, schools that decided not to participate, it was decided to 
print substantially more test books than the minimum provided for in the 

^ 32 

ERIC 



-25- 



atudy design. Thus, If the participation rate turned out to be unexpectedly 
high, sufficient test books would be available to permit all schools to test. 
For SAT-verbal, 15,000 test books were ordered and for Chemistry 3,6000 test 
books were ordered. Planning for SAT-verbal was based on the number of 
SAT-takers In 1978-79. In selecting the schools to be Invited, we felt 
that we could safely Invite schools having 17,000 SAT-takers and still be 
able to have a lOX overage In shipments to schools. For Chemistry, It was 
estimated that about one-third of the 11th grade students In a school would 
be taking that subject. Accordingly, It was decided to Invite schools 
having a total estimated enrollment In the 11th grade of about 10,800. 

In designing the SAT-verbal sample, special steps were taken to Insure 
that a sufficient number of minority group members would be Included to 
provide an adequate sample for separate studies. An exploratory survey of 
' the data available on the school lists for SAT-verbal Indlcafd that this 
objective could be achieved If approximately half of the prospective exam- 
inees were enrolled In the schools having the highest percentage of Black 
students among their College Board test takers. The other half of the group 
of prospective examinees could then be obtained from the other schools on 
the lists. Accordingly, the Ipltlal sample Included 49 schools having 17X 
or more Black students among their College Board test-takers and 60 schools 
selected at random from the remaining schools on the lists. On the basis 
of the 1978-79 school data. It was estimated that about 24Z of the prospec- 
tive examinees In the SAT-verbal sample would be minority group students 
and about 18Z would be Black students. 

The remainder of the SAT sample was selected by random sampling from 

33 




the total list of schools, excluding Alaska, Hawaii, and the 49 schools 
already selected. There were 635 schools In the eligible group. It was 
decided to Include 60 schools, enrolling an estimated 8,631 SAT-takers, 
In the list to be Invited. Thus, the total SAT sample Included 109 
schools and approximately 17,000 students. 

The Chemistry sample was selected from a list of 61 schools having 
a school code with the same (randomly selected) last digit and having 25 
or more Achievement Test-takers in 1978-79. Enrollment estimates for 11th 
grade students we j available for 58 schools from the current Preliminary 
Scholastic Aptitude Test data files. It was decided to impose the further 
requirement that the number of Achievement Test-takers (as seniors) should 
be at least 15% of the 11th grade *inroJlment. As it turned out, the 33 
schools meeting these requirements had an estimated total 11th grade en- 
rollment of 10,541. As a reoult, these 33 schools were ?plected for the 
Chemistry sample. 

On the basis of a preliminary survey of the returns from the initial 
mailing, it was decided to augment the sample by inviting additional schools 
to participate. Because of the tight schedule, these schools were selected 
only from among those located in New Jersey or in nearby states. Of the 50 
supplementary schools for the SAT-verbal sample, 15 schools having a per- 
centage of Black students of 7.0 or higher were selected using the lists for 
both SAT-verbal and Chemistry. The remaining schools were selected at random 
from the two school lists prepared for the SAT -verbal sampling. The 20 
supplementary schools for Chemistry were selected from schools on one of 
the SAT-verbal lists that had not been selected for that study. The 



-27- 



supplementary sample v^s selected at. .random from schools having 25 or 
more Achievement Test-takers. 

Of the 109 schools included in the SAT-verbal sample, 52 provided 
usable data for the study. The supplementary sample provided data for 
17 schools. For Che^iistry, 19 of the 33 schools included in the initial 
sample participated, and nine of the schools in the supplementary sample 
participated. 

Description of the Samples 

The 69 schools in the SAT-verbal sample are located in 19 states. 
New York was represented by 14 schools, and California and Pennsylvania 
were each represented hy nine schools. The 28 schools in the Chemistry 
sample were located in 12 states and the Dlatrict^of Columbia.* Six of the 
28 schools were in Massachusetts, with five in New Jersey and four in 
Pennsylvania. 

Characteristics of the students included in the SAT-verbal and 
Chemistry samples are shown in Table 1. Of the 6,260 students in the SAT- 
verbal sample, 1,172 belonged to the Black/Afro-American group and 257 were 
members of the three Hispanic subgroups. Although the sample size both 
for the total group and for minority group members was smaller than had 
been planned, the groups were sufficiently large to provide a uoeful data 
base for the study. A large percentage (58Z) of the participants were 
female students, and 92X reported that they were in the upper three-fifths 
of their classes academically. 

Approximately half (595) of the 1,172 Black students in the total 
SAT-verbal sample were enrolled In six schools, each of which enrolled 50 or 



ERJC 35 



Table 1 



Distributions of^Ethnic Group Membership, Sex, and' 
Rank in Class in SAT-vetbal and Chemiatry Samples 



SAT-verbal Chemistry 
Characteristic N % n X 



Ethnic/f^oup Membership 










American Indian 


35 


0.6 


9 


0.4 


Asian American 


121 


2.0 


65 


2.9 


Black/ Afro American " 


1172 


19.2 


36 


1.6 


Caucasian 


4291 


7C.2 


2009 


90.9 


Chicano/Mexican American 


lAl 


2.3 


4 


. 0.2 


Puerto Ric an /Puerto Rican 


61 


1.0 




0.3 


American 


7 


Spanish American' 


55 


0.9 


11 




Other 


230 


3.8 


68 


3.1 


Missing Data 


154 





97 





Total 


6260 


100.0 ' 


2306 


99.9 


Sex 










Female 


3ol9 


CO A 

58.0 


1084 


47.1 


Male 


2623 


42.0 


1216 


52.9 


Missing Data \ 


13 




6 




Total ^ 


6260 


100.0 


2306 


100.0 


Rank in Class \ 










High Fifth 


2004 


32.5 


831 


36.6 


Second Fifth 


1939 


31.5 


810 


35.7 


Third Fifth 


1724 


28.0 


514 


22.7 


Fourth Fifth 


375 


6.1 


76 


3.4 


Low Fifth 


118 


1.9 


37 


1.6 


Missing Data 


100 




38 




Total 


6260 


100.0 


2306 


100.0 



36 



-29- 



more members of the Black student sample. Another one-third (391) were 
enrolled In 12 schools that had from 20 to 48 sample members; The remain- 
ing 187 students wte^ enrolled In 34 schools. Twenty-seven schools did 
not test any Black students for the study. 

In the Chemiatry sample, about 91X of the tested group who reported 
ethnic group membership were White. About 53X were male, preswnably re- 
flecting a greater tendency for malef Han for 'females to enroll in 
chemistry courses. For Chemistry, over 72 percent were in the top two- 
fifths in self-reported Rank in Class, and only 5X were in the bottom 
two-fifths. 



Er|c ' 37 



Test Administration 



In preparing the test booklets, particular attention was given to 
the directions with respect to guessing. Because spiralling within school 
was considered essential to making subgroups receiving different directions 
as comparable as possible, the supervisor's instructions had to be appro- 
priate for both klndc of directions. At the beginning of the SAT-verbal 
testing, the supervisor read the following statement: 

You are about to take part in an experiment 
concerned with the College Board Scholastic ^ 
Aptitude Test being conducted by the Educational 
Testing Service, the organization that constructs 
the College Board SAT and Achievement Tests. The 
experiment, which will be extremely important to 
studdnts taking the tests in future years, is 
belrig done in order to learn more about the effect 
of test directions on your test performance. 

The statement for Chemistry examinees was the same except for the 
name of tjie test. 

Just before the SAT-verbal examinees began work on the first section, 
the supervisor read the following statement: 

This test Includes two separately-timed one-half 
hour sections. Each section has special directions 
concerning guessing. Some of you will have the same 
directions concerning guessing for both sections; others 
will have different directions for the two sections. 
Please read the directions for each section on ^our 
test booltlet carefully, and answer the questions in 
• each section according to the directions for that 
section. y 

At the beginning of the second separately-timed section, the supervisor 
again instructed the students to rea^ the directions for the section 



*In this discussion separately-timed parts of SAT-verbal are 
referred to as sections. 

' • 3S 



-31- 



carefully, and allowed time for them to do so. Directions concerning 
Formula and Rights scoring were printed on a separate page from other 
directions for the tests in order to emphasize their importance. 

The corresponding statement for the Chemistry examinees, given by 
the supervisor as soon as the test bookle^ts were distributed was as 
follows: 

This is a 90-item, one-hour test. Please read 
the directions in your test book carefully, and 
answer the questions according to the directions. 

The Chemistry examinees were instructed to read the directions about 

guessing as soon as they opened their test booklets; the SAT -verbal 

examinees were instructed to read the directions about guessing just before 

they begatl work on each section. 

The Rights directions for SAT-verbal were adapted from the Rights 

directions used in the Law School Admission Test. They were as follows: 

Read the directions below carefully, and answer the 
questions in this section according to these directions. 

Your score on this section will be based on the number 
of questions you answer correctly. No deduction will be 
made for wrong answers. You are advised to use your time 
effectively and to mark the best answer you can to every 
question, regardless of how sure you are of the answer you 
mark. 

The Formula directions were essentially the directions used for 

operational administrations of SAT -verbal as follows: 

Read the direct^ions below carefully, and answer the 
questions in this section according to these directions. 

Students often ask whether they should guess when they 
are uncertain about the answer to a question. Your 
score on this section will be based on the number of 
questions you answer correctly minus a fraction of 
the number you answer incorrectly. Therefore, it is 



30 



Improbable that random or haphazard guessing will 
change your scores significantly. If you have some 
knowledge of a question, you may be able to elimin- 
ate one or more of the answer choices as wrong. It 
Is generally to your advantage to answer such 
questions even though you must guess which of the 
remaining choices Is correct. Remember, however, 
not to spend too much time on any one question. 

Do not worry If you are unable to finish this * 
section or If there ate some questions you cannot 
answer; many students leave questions unanswered. 
You should work as rapidly as you can without 
sacrificing accuracy. Do not waste time puzzling 
over a question which seems too difficult for you. 

The special directions for the Chemistry examinees were precisely 
the same except that the word ''test" was substituted for the word "section," 

Supervisor's manuals were prepared for SAT -verbal and for Chemistry. 
TJiese manuals were adapted from the manuals ^sed with regular College 
Board Tests, Ac suggested by the ETS Board of Prior Review, students were 
Informed by the supervisor at the beginning of the testing session that 
their participation In the testing and In answering the questions was 
strictly voluntary and that each student's scores would be reported only 
to his or her school and to the student and would be reported only If the 
student requested It by marking an appropriate space on the answer sheet. 
Students were Informed that the experiment was being done "In order to 
learn more about the effect of test direction^ on your, test performance," 

All schools were asked to administer the tests between AprlJ. 15 and 
April 30, 1980, For SAT-verbal, "juniors who are planning to take the SAT" 
were defined as the group to be tested; for Chemistry, "students who are 
currently enrplled In the secind^emester of a one-year course In Chemistry" 
were defined as the appropriate group. Within these general guidelines. 



40 



each school devised Its own procedures for Inviting appropriate students 
to participate and*^ f or scheduling and administering the tests. 

Except for the analyses of equating methods, analyses of Formula 
score data were carried out using exact — I.e., unrounded — Formui . scores. 
Because the equating portion of the study was developed to guide opera- 
tional practice, the analyses of Equating methods made use of rounded 
Formula scores, Ir^. which exact scores ending In .5 were uniformly 
rounded upward, as they are In operational practice at ETS. 



f 



Ref alts 

Effect of Directions on Mean Fq^inula Scores; Method of Analysis 

One set of analyses was designed to test the hypothesis that the 
mean score for students examined under Rights directions will be equal 
to the mean score f^r students examined under Formula directions when 
Formula scoring Is ii3ed for both groups of students. This hypothesis, 
it is recalled, is referred to as the Invarlance Hypothesis « If, as 
Slakcer and dthers have maintained, students tend to omit (Questions about 
which they have useful partial knowledge, we would expect that the In- 
varlance Hypothesis would-be rejected, and that the mean Formula scores 
for students tested under Rights directions would be significantly higher 
than the mean Formula scores for students tested under Formula directions. 

The study design called for giving one group of students'^ a test with 
Formula directions and giving a comparable group of students the same test 
with Rights directions. The experiment was decxgned to make the groups 
receiving different directions as similar as possible in all other respects* 
The use of the method of spiralling, described earlier in thl$ report, 
when a*pplied to large groups, tends to produce groups that are quite 
similar, not only in the abilities measured by the tests but in other 
respects as well. Moreover, this method, as u&ed in this study, tends to 
insure that, within each participating school, the number of students 
' Shelving each type of directions will be very nearly equal* 

In planning the data analysis, it was recognized that the use of a 
simple t-test of means and would not take account of tbe fact that examinees 



42 



-35- 



were assigned to samples by spire"* ling rather than by simple random 
sampling. Under the conditions of this study, spiralling insured that 
each participating school wuuld be represented by approximately the 
same number of studetits in each sample. Because schools differ appre- 
ciably in the ability level of their students, the spiralling method 
would be expected to yield a smaller variability across samples than 
would simple random sampling. Thus, a simple t-test would b^ based on 
an underestimate of the precision of the experiment. Accordingly, it 
was decided to use a regression approach, using School Attended to. 
create a set of dummy variables. It was further decided to use Sex, 
Ethnic Group Membership, and Rank in Class as covariates in analyses .of 
SAT scores in order to increase further the precision of the group com- 
parisons. For this purpose. Rank in Class was analyzed by calculating 
regression weights not only for the observed values but also |or the 
second, third, and fourth powers of the ranks. Orthogonal polynomials 
were used in determining the weights for the higher order variables 
because it was expected that the intercorrelations of the four variables 
created by computing successive powers would be high. Ethnic Group Mem- 
bership was so coded as to provide the following categories: Black-White; 
Hispanic-White; Other-White. Students who did not report Sex, Rank in 
Class, or- Etiinic Group' Membership were not included in the analysis of 
means . 

The foregoing analysis plan was applied to Form A and Form B of SAT- 
verbal separately for Part 1, Part 2, :?nd Total scores, yielding six 

ErIc i3 



-36- 



analyses. In addition, the study design made It possible to perform two 
analyses using Part 2 scores on Form A as the dependent variable with 
Part 1 scores as an additional covarlate. Finally, one analysis of 
mean scores was made for the students taking the College Board Chemistry 
Test. In this analysis, only School, Sex, and. Rank In Class were used 
as covarlates, because less than 10% of the Chemistry sample were members 
of ethnic groups other than White, and because 97 of the 2,306 members 
of the Ch^&i^try sample did not report ethnic group membership.' 
Effect of Directions on Mean Fn^mula Scores; Findings 

A useful preliminary survey of the results of this study can be 
obtained by consldeilng the means and standard deviations of Rights and 
Formula scores f o * part and total scores shown in Table 2. Within each of 
the seven sots of results, the spiralling design yields mean scores that 
may be comparea ^llrectl/ with eech other. 

Considering first the results when Rights scoring is used, there is a 
consistent pattern for Rights directions to yield higher mean scores than 
do For.aula directions. This result is certainly to be expected on logical 
grounds because under Rights directions the student's optimal strategy 
is to answer every 'tem. However, the effect of the directions on the 
mean Rights scores is relatively small. Only for Total scores on the tests 
does the difference in Rights scores exceed -^ne raw score point. It is 
also noted that Rights scores obtained when Formula directions were used 
tend to have somewhat smaller standard deviations than Rights scores . 
obtained when Rights directions were used. 



44 



Descriptive Statistics on Rights and Exact (Unrounded) Formula Scores for Part and Total 
Scores on SAT-verbal and foi? Chemistry Total Scores for Each Subgroup 









Number 








Rights 


a 

Score 


a 

Formula Score 


















S.D. 


Test 


Form 


Part 


of Items 


Group 


Directions 


N 


Mean 


S.D. 


Mean 


SAT-V 


A 


1 


45 


1 


R 


1026 


22.22 


'6.93 


17.11 


8.30 


II 


II 


11 


M 


2 


R 


1068 


22. 14 


6. 71 


1 / . 00 " 




II 


II 


•• 


11 


1 -1-9 


p 


2094 


22. J8 


6.82 


17.06 


8.19 


II 




•1 


11 


3 


F 


1054 


21.80 


6.85 


17. 22 


8. 19 


II 


fi 


•1 


II 


4 


F 


1038 


21.50 


6.51 


10 9 00 


7 78 


II 


II 


•• 


M 




F 


2092 


21.65 


6.69 




7 00 


II 


B 




11 


s 


R 


1040 


21.02 


7.09 




A AS 


II 


II 




11 


O 


F 


1034 


20. 37 


6.94 




O r OO 


SAT-V 


A 


2 


40 


1 


R 


1026 


18.24 


6.42 


13.18 


7.81 


II 


II 


U 


II 


3 


n 


1054 


18.23 


6.33 




7 An 
/ • ou 


II 


II 


II 


II 


■ 1 4-"^ 
XT J 


R 


2080 


18. 24 


6. 37 


13-25 


7.70 


II 


II 


II 


•1 


2 


F 


1068 


17.45' 


6.08 


12. 88 


7.39 


II 


II 


II 


II 


4 


F 


1038 


17.48 


5.97 


19 R4 


7. 26 


II 


II 


II 


ti 


244 


F 


2106 


17.46 


6.03^ 


12. 86 


7. 32 


II 


B 


II 


II 


5 


R 


1040 


19.28 


6.62 






II 


II 


II 


II 




F 


1034 


18.67 


6.52 


14.26 


8.01 


SAT-V 


A 


Total 


85 


1 


R 


1026 


40.4? 


12. 75 




15.36 


II 


II 


fi 


M 


4 


F 


1038 


38.99 


11,90 


29. 70 


14. 36 


II 


B 


11 


II 


5 


R 


1040 


40.30 


13.07 


30.08 


15.82 


II 


II 


11 


II 


6 


F 


1034 


39.03 


12.86 


29.76 


15.68 


Chem. 




Total 


90 


7 


R 


1151 


34.50 


10 . 74 


22.06 


13.03 


11 




II 


II 


8 


F 


1155 


32.21 


10.26 


21.52 


11.90 



Vans and ste dard deviations for which the scoring is consistent with the directions appear in italics. 



To some extent, the fact that the data were collected under experi- 
mental conditions rather than operational conditions may have resulted 
In smaller effects for differences In directions. In particular. It 
seems plausible that students would have less Incentive to make random 
responses when the tests are given under Rights directions under the 
experimental conditions than under conditions of actual (I.e., formal* 
operational) testing. 

The results obtained when the tests were scored by Formula provide 
a useful preliminary indication of the extent to which the use of Formula 
scoring removes the effect of different directions. This preliminary 
analysis suggests that the use of Formula scores reduces but does not 
eliminate the effect of different directions on the mean scores. With 
respect to the standard deviation of scores, the use of Formula scoring 
does not seem to have any consistent effect on the relative size of the 
standard deviations. Formula directions tend to yield slightly smaller 
standard deviations than Rights directions for Formula scoring, just as 
they did for Rights scoring. The problem of standard deviations will 
be considered again, when equating methods are applied to the data. 

Application of regression methods to the data obtained using 
Formula scoring provides more precise estimates of the effects of dif- 
ferences in directions on mean Formula scores and also provides signi- 
ficance tests for the obsei.ved differences. Results for nine analyses 
based on all members of the designated groups who had complete data on 
the covariates are presented in Table 3. Of the eight analyses performed 



47 



Table 3 

Effect of Differences in Directions on Formula Scores: All Students' 



Dependent Variable 



Test Form Part(8) 



SAT-V 



A 
B 
A 
B 
A 
B 
A 
A 



1 

1 • 

2 

2 
1+2 
1+2 

2 

2 



Chem. 



Number 
of 

Items 

45 
45 
40 
40 
85 
85 
40 
40 
90 



Covarlates 



School,Sex, Rank, Ethnic Group 
II 



Sch, Sex, Rank, EthnicGp, Part 1 



School, Sex, Rank 



Group (s) 

Tested 

Using 

Rights 

Directions 

1+2 

5 
lf3 

5 

1 

5 

1 

3 

7 



Group(s) 

Tested 

Using 

Formula 

Directions 

3+4 

6 
2+4 

6 

4 

6 

2 

4 

8 



t-test Results 



N 



4013 
1986 
4013 
1986 
1985 
1986 
2008 
2005 
2262 



Adjusted . 
Difference 



+0.03 
+0.26 
+0.49 
■M).24 
-K).82 
■K).50 
+0.29 
+'..31 
-0.00 



0.17 
0.90 
2.63 
0.88 
1.64 
0.97 
1.49 
1.71 
-.00 



Students with missing data on any covariate are excluded. 

'in all analyses, a plus sign indicates that students tested using Rights directions earned 
higher mean adjusted scores. A minus sign indicates that students tested using Formula 
directions earned higher mean adjusted scores. 

'Significance level, using two-tailed tests: 
** - p < .01 



49 



-40- 



for SAT-verbal, all show that students tested using Rights directions 
have a somewhat , higher mean Formula score than students tested using 

Formula directions. However, only one of the eight differences attains 

statistical significance. In the two analyses using Part 1 scores as 

an additional covarlate, the adjusted differences on the AO-item Part 2 

^ere approximately three-tenths of a raw score point. The two results 
for Total Formula scores on the 85-ltem SAT-verbal have an average 
value of about two-thirds of a raw score point. A difference of two- 
thirds of a raw score point on Vtrm A of SAT-verbal would be equivalent 
to about 5 scaled score points (since the slope parameter for this form 
of SAT-verbal is /.2588), an amount which is more likely than not to be 
Inflated simply because of the greater speededness, and consequent lower 
mean scores, in a Formula-directed test. But even if this finding were 

.taken at face value in support of the Differential Effect Hypothesis, 
it must be emphasized that the significance results support the Invarlance 
Hypothesis. From a practical standpoint, the results suggest that assuming 
the InVarlance Hypothesis for equating Rights scores to Formula afcores 
would be unlikely to result in a serious discontinuity in the scale, 
at least, in the vicinity of the mean. 

Results for the single analysis of Chemistry Test data show a 
difference in adjusted means almost exactly equal to zero. Perhaps the 
most tenable interpretation of this result is that it is essentially 
similar to the results for SAT-verbal. The specific outcome for Chemistry 
suggests that, if anything, the Invarlance Hypothesis is more appropriate 



: — ' 51 

-41- 

for subject-matter tests than for aptitude measures. 

The data used in the analyses in which Part 1 scores were used as a 
covariate may also be analyzed using Part 1 as a stratification variable, 
and assuming that the outcome is not affected by the fact that one of the 
two groups in each comparison was given different directions for the two 
sections and the other was not. When stratification is based on Part 1 
scores, the results throw light on the question of whether the effect 
of differences in scoring directions is related to a student's ability 
level. It is also possible to divide the sample Into two groups on the 
basis of Items Not Attempted in Fart 1 administered under Fonfnila 
directions. This analysis provides a comparison of results for students 
who chose not to answer a substantial number of items (9 or more) with 
results for students who chose not to answer relatively few items (8 or 
fewer). As shown in Table 4, there is no app rent trend in the size of 
the effect when students are stratified on Rights scores, although 
students in the highest stratum show the largest effect. When stratifi 
cation is-on Formula scores, there is a trend for effects to be larger 
for students in the lower strata. When the two results are considered 
together, they suggest that there is no consistent trend for ability 
level to be related to the size of the effect of different directions. 
Results for the groups stratified on the basis of items Not Attempted, 
when the test used for stratification was administered under Formula 
directions, are quite similar for students having 9 or more Nonattempts 
and for those having fewer than 9 Nonattempts. This result would be 



51 



* 



Table 4 

Effect of Differences In Directions on Formula Scores Earned on Part 2 of SAT-verbal for Group 
Stratified on the Basis of Rights Scores, Formula Scores, and Nonattempts on Part 1^ 







- 


Group Tested 


Group Tested 


t- 


-test Results 


stratification 






Using Rights 


Using Formula 




Adjusted ^ 




Variable 




Interval 


Directions 


Directions 


XT 

N 


Difference 


c 


Rights Score 


25 


and higher 


1 


2 


734 


40.75 


+1.69 


ti 




2^24 


ft 


ft 


600 


+Tr.oo 


40.00 


It 




15-19 


It ^ 


tt 


412 


+0.42 


40.91 


ft 


14 


and lover 


tt 


tt 


262 


40.34 


40.63 


II 




Totil 


tt 


tt * 


2008 


40.38 


+1.41 


Formula score 


21 


and higher 




4 


665 


40.27 


40.63 


tt 




15-20 


tt 


fl 


613 


40.29 


40.82 


tt 




9-14 


ft 




447 


40.37 


40.89 


ti 


8 


and lover 


ft 


ft 


280 


40.72 


+1.55 


tt 




Total 


ft 


ft 


2005 


40.60 


+2.36* 


Items Not 
















Attempted 


9 


and higher 


3 


4 


475 


40.67 


+1.32 


tt 


8 


and lover 


tt 


tt 


1530 


40.59 


+2.00* 


tt 




Total 


tt 


tt 


2005 


40.60 


+2.36* 



Covarlates used in all analyses vere: School^ Sex, Rank in Class, and Ethnic Group- 
Students vith missing data on any covarlate vere excluded. 



In all analyses, a plus sign indicates that students tested using Rights directions earned 
higher mean adjusted scores. A minus sign indicates that students tested using Formula 
directions earned higher mean adjusted scores. 

Iflcancp levels, using tvo-^tailed tests: 



ERIC 



* p < .05 



-43- 



plausible if the Invariance Hypothesis is warranted. 

In addition to the analyses based on the total group, two regression 
studies of SAT-verbal were performed using only Black students (Table 5). 
In these analyses, it was possible to combine data for two groups who had 
Rights directions on Form A and for two groups who had Formula directions 
on Form A. In both analyses, the differences were small and were not 
statistically significant. As it happened. Black students earned 
slightly higher Formula scores when tested using Formula directions than 
when tested using Rights directions. Although these results are opposite 
in direction to those found for the corresponding total groups, it seems 
probable that the difference is attributable to sampling fluctuations 
and that attempts to interpret this difference would not be warranted. 

The study design provided that two of the groups taking SAT-verbal 
would have Rights directions on one part and Formula directions on the 
other Part. Thus, it was possible to compare the performance on Part 2 
of students who had different directions on the two parts with the 
performance of students who had the same directions on both parts. As 
shown in Table 6, results based t>n all students show that performance 
on Part 2 for groups that had different directions on the two parts of 
the test is remarkably similar to the results for groups that had the 
same directions on the two parts of the test, indicating that students 
at this level are quite capable Cf changing the^ir guessing strategies 
in the middle of an administration in accordance with changes in direc- 
tions to guess, as was assumed in the discussion of Table 4. It was 



ERLC 



53 



Table 3 



Effect of Differences in Directions on Formula Scores: Black Students 



Dependent Variable 



Test Form Part(s) Items 
SAT-V A 1 45 

" A 2 40 



Covariates 



Scnool, Sex, Rank 



Group (s) 
Tested 
Using 
Rights 



Group(s) 
Tested 
Using 
Formula 



Directions Directions 
3+4 



1+2 
1+3 



2+4 



t-test Results 



N 

760 
760 



Adjusted ^ 
Difference t 



-0,23 -0.52 
-OaO -0.27 



I 
I 



^Students with missing data on any covariate were excluded. 

hn all analyses, a plus sign indicates that students tested using Rights directions earned 
higher mean adjusted scores. A minus sign indicates that students tested using Formula 
directions earned higher mean adjusted scores. 



5.* 



ERIC 



Table 6 



Effect of Directions on Part 1 on Scores Earned on Part 2 



Dependent Variable 
Part Directions Score 



Covarlates 



Group Tested 
Using Rights 
Directions on 
Part 1 



t^test Results 



Group Tested 

Using Formula 

Directions on No. of Adjusted 



Part 1 



Cases Difference 



2 
2 



Rights 
Formula 



Results Based on All Students In Designated Groups 
Formula School, Sex, Rank, Ethnic Group 1 3 

Formula " ^ 4 



1992 
2021 



-0.13 
-0.11 



-0.50 



-0.42 



I 



2 
2 



Rights 
Formula 



Results Based on Black Students In Designated Groups 
Formula School, Sex 1 ^ 



Formula 



4 



389 
371 



-0.97 
-0.23 



-1.82 
-0.42 



ERLC 



^In this table, a minus sign Indicates that students who had the same directions on both 
parts of the test earned lower mean adjusted scores on Part 2 than did those students 
who had dlficrent directions on the two parts of the test. 

'^Students with missing data on any covarlate are excluded. 



50 



57 



-46- 



found that Black students who were asked to shift from Formula directions 
on Part 1 to Rights directions on Part 2 earned somewhat higher Part 2 
scores than Black students who had Rights directions for both parts, 
although the difference did not attain statistical significance. For 
the other analysis of shift In directions, the results for Black students 
were quite similar to those for all students. 
Effect of Directions on Nonresponse 

Four types of sccres, other than Rights and Formula scores, were 
subjected to separate Investigation In this study. These are the number 
of items Omitted, the number of Items Not Reached, the number of Items 
Not Attempted (the sum of the number Omitted and the number Not Reached), 
and an arbitrarily constituted measure of Guessing, to be discussed In 
the next section. In defining these variables It Is assumed that all 
items left unanswered prior to the last item reached were in fact con- 
sidered, but Innentlonally left unmarked, presumably for reasons of 
insufficient knowledge,- ability, skill, etc. These are the Omitted items. 
All items left unmarked beyond the last item marked are presumed to be 
those that the examinee has not had time to consider. These are the 

items Not Reached . 

In order to determine whether there was any relationship between 
score level and ethnic group and the number of items Omitted, Not Reached, 
Not Attempted, and Guessed, the tabulations in Tables 7-12 were prepared, 
one for each of the four groups taking Form A of SAT-verbal, and one for 
each of the two groups taking Form B of SAT-verbal. Each of these six, 



Table 7 



Number of Items Omitted, Not Reached, Not Attempted, and Guessed on Part 2 
of SAT-V, Given with Rights Directions, for White (W) , Black (B) and Black 
plus Hispanic (B+H) Studrnts, Stratified by Rights Scores on Part 1 

(Based on Group 1) 



Score 
Interval 


Ethnic 
Group 


N 


Omitted 




Not Reached 


Not Attempted 


GuessinK Index _( 


Mean 


SD 


Mean 


SD 


Mean 


SD 


Mean 


SD 


25 and higher 


W 


316 


0. 25 


1. 


20 


0.34 


1. 29 


0.59 


1.78 


15.03 


5.35 




B 


1 Q 

lo 


0.22 


0. 


55 


,0.33 


1.03 


0.56 


1.10 


17.89 


4.59 




B+K 


23 


0.26 


0. 


54 


0.26 


0.92 


0.52 


0.99 


18.22 


5.13 


20 - 24 


W 


223 


0.66 


2. 


10 


0.58 


1.62 


1.25 


2.71 


20.20 


5.68 




D 




1.07 


2. 


67 


0.80 


1.98 


1.87 


3.32 


20.78 


6.33 






52 


0.96 


2. 


53 


0.83 


2.02 


1.79 


3.25 


20.69 


6.09 


15 - 19 


W 


120 


0.80 


2. 


73 


1.05 


2.19 


1.85 


3.55 


22.22 


6 63 




D 




0.48 


2. 


45 


1.61 


3.21 


2.08 


4.59 


24.38 


6.74 




B-fH 


79 


0.38 


2. 


16 


1.34 


2.92 


1.72 


4.13 


24.75 


6.23 


14 and lower 


W 


35 


2.40 


5. 


80 


1.37 


2.74 


3.77 


6.44 


21.49 


11.67 




B 


77 


1.39 


3. 


46 


3.01 


4.77 


4.40 


7.12 


24.53 


9.41 




B-m 


90 


1.30 


3. 


39 


2.87 


4.59 


4.26 


6.78 


24.29 


9.16 


Total 


H 


694 


0.59 


2. 


29 


0.59 


1. 70 


1.18 


2.93 


18.26 


6.84 




B 


202 


0.94 


2. 


85 


1.85 


3.69 


2.78 


5.46 


23.0'+ 


7.92 




B4-H 


244 


0.86 


2. 


70 


1.69 


3.51 


2.56 


5.14 


23.10 


7.63 



Omits plus Not Reached 

51) 

er|c 



Table 8 

Number of Items Omitted, Not Reached, Not Attempted, and Guessed on Part 2 
of SAT-V. Given with Fo rmula Directions, for White (W), Black (B), and Black 
plus Hispanic (B4«) Students, Stratified by Rights Scores on Part 1 

(Based on Group 2) 



Score 
Interval 



Ethnic 
Group 



Omitted Not Reached Not Attempted^ Guessing I ndex (W-0) 



Mean 



SD 



Mean 



SD 



Mean 



SD 



ERIC 



Mean 



SD 



25 and higher 


W 


341 


3.32 


3. 


94 


0. 43 


1 "J A 

1. J4 


'K 7«i 
J • / 3 


L 

t . ^7 


10. 26 


8.01 




B 


26 


3.50 


4. 


18 


0.69 


2.11 


4.19 


k.ll 


12.65 


8.99 




B-Hl 


30 


3.43 


4. 


17 


0.77 


2.13 


4.20 


4.96 


12.43 


8.48 


20 - 24 


W 


233 


3.47 


4. 


3o 


U. 


1 QQ 




5.19 


14.79 


8.98 




B 


43 


2.44 


4. 


50 


2.05 


2.86 


4.49 


6.03 


16.58 


9.10 




B+H 


56 


3.21 


5. 


20 


1.80 


2.69 


5.02 


6.27 


15.64 


10.30 


15 - 19 


W 


120 


3.19 


4. 


71 


1.55 


2.84 


4.74 


5.73 


17.93 


9.97 




B 


57 


1.86 


3. 


25 


1.65 


3.22 


3.51 


5.35 


22.25 


7.95 




B-ffl 


70 


2.03 


3. 


55 


1.67 


3.14 


3.70 


5.35 


21.56 


8.02 


14 and lower 


U 


49 


3.45 


5. 


20 


2.53 


3.38 


5.98 


6.25 


19.45 


10.53 




B 


65 


1.52 


3. 


60 


2.40 


3.59 


3.92 


5.44 


24.02 


8.71 




B-ffl 


78 


1.50 


3. 


71 


2.18 


3.39 


3.68 


5.38 


24.27 


8.60 


Total 


W 


743 


3.36 


4 


.29 


0.91 


2.11 


4.27 


5.00 


13.53 


9.41 




B 


191 


2.10 


3 


.83 


1.86 


3.18 


3.96 


5.43 


20.27 


9.50 




B4fl 


234 


2.32 


4 


.18 


1.76 . 


3.03 


4.07 


5.54 


19.88 


9.81 



raoxe ^ 



Number of Items Omitted, Not Reached, Not Attempted, and Guessed on Part 2 
of SAT-V, Given with Rights Directions, for White (W) , Black (B), and Black 
plus Hispanic (B-Hl) Students, Stratified by Formula on Part 1 

(Based on Group 3) 



Score 


Ethnic 






Omitted 


Not Reached 


Not Att 


empted 


Guessing 


Index (W-0) 


Interval 


Group 


N 


Mean 


SD 


Mean 


SD 


Mean 


SD 


Mean 


SD 




w 


ion 


0 


.54 


1.51 


0.31 


1.17 


0.85 


2.02 


14.51 


5.20 




B 


20 


0. 


50 


0.89 


0.55 


1.50 


1.05 


1.93 


16.15 


3. 92 




B+H 


27 


0. 


78 


1.85 


0.59 


1.58 


1.37 


2.42 


15.52 


3.93 - • 


15 - 20 


W 


21 7 


0. 


87 


2.47 


1.04 


2.26- 


1. 90 


3. 54 


18.77 


5.83 




B 


42 


1. 


26 


2.55 


1.24 


2.49 


2.50 


3.55 


19.64 


5.50 




B+H 


50 


1. 


12 


2.39 


1.06 


2.32 


2.18 


3.35 


19.64 


5.29 


9-14 


W 




1. 


39 


3.15 


1.14 


2.50 


2. 54 


4. 15 


21.05 


7.09 




B 


60 


1. 


78 


4.24 


1.65 


2.59 


3.43 


5.18 


21.78 


8.53 




B+H 


77 


1. 


84 


4.42 


1.70 


2.75 


3.55 


5.30 


21.53 


8.78 




W 




0. 


91 


^. 39 


U. 93 


3.03 


1. 84 


4.21 


24.38 


5.59 




B 


77 


1. 


25 


3.65 


1.99 


3.48 


3.23 


5.78 


25.47 


8.48 




B+H 


90 


1. 


29 


3.79 


2.31 


5.22 


3.60 


7.00 


25.01 


9.30 


Total 


W 


722 


0. 


83 


2.28 


0.73 


2.01 


1.55 


3.23 


17.67 


6.60 




B 


199 


1. 


34 


3.47 


1.58 


2.89 


2.92 


4.93 


22.19 


8.14 




B+H 


244 


1. 


37 


3.60 


1.67 


3.75 


3.05 


5.50 


21.76 


8.50 



Omits plus Not Reached 



ERIC 



ei 



Table 10 



Number of Items Omitted, Not Reached, Not Attempted, and Guessed on Part 2 
of SAT-V, Given with Formula Directions, for White (W), Black (B), and Black 
plus Hispanic (B-HI) Students, Stratified by Formula Scores on Part 1 

(Based on Group 4) 



Score 
Interval 


CiLlinXC 

Group 


N 


Omitted 


Not Reached 


Not Attempted* 


Guessing Index 


Mean 


SD 


.Mean 


SD 


Mean 


ail 


Moo t\ 


SD 


21 and higher 


W 


277 


2.79 


3.32 


0.55 


1.57 


3.34 


3.63 


10.44 


7.09 




B 


12 


2.33 


3.37 


0.92 


1. 73 


3. 23 






7-45 




B-Hl 


18 


2.06 


3.13 


1.11 


2.08 


3.17 


4.03 


14.33 


8.15 


15 - 20 


W 


262 


3.61 


4.55 


1.02 


2.18 


4.63 


5.23 


13.93 


8.87 




B 


45 


2.44 


3.22 


1.18 


2. 26 




A no 


A. f •yj £. 


6 54 




B-Hl 


58 


1.98 


2.98 


1.09 


2.12 


3.07 


3.81 


18.10 


6.23 


9-14 


W 


128 


2.73 


3.89 


1.80 


2.87 


4.54 


5.31 


18.33 


8.35 




B 


70 


2.06 


3.56 


2.10 


4.26 


4.16 


6.86 


21.49 


9.16 




B-Hl 


82 


1.96 


3.44 


2.02 


4.33 


3.99 


6.91 


21.68 


9.09 


ft and lower 


W 


47 


1.87 


3.27 


1.60 


2.85 


3.47 


4.88 


23.13 


8.20 




B 


64 


0.83 


2.17 


1.75 


3.39 


2.58 


4.19 


26.30 


6.21 




n 1 IT 

OTTl 


73 


0.86 


2.11 


1.64 


3.24 


2.51 


4.01 


26.23 


6.11 


Total 


w 


714 


3.02 


3.93 


1.02 


2.21 


4.04 


4.70 


13.97 


8.87 




B 


191 


1.75 


3.11 


1.69 


3.45 


3.44 


5.33 


21.53 


8.55 




B4it 


231 


1.63 


2.96 


1.60 


3.39 


3.23 


5.20 


21.65 


8.33 



[ERIC 



Table 11 



Number of items Omitted, Not Reached, Not Attempted, and Guessed on Part 2 
of SAT-V, Given with Rights Directions, for White (W) , Black (B) , and Black 
plus Hispanic (B-fH) Students, Stratified by . Rights Scores on Part 1 

(Based on Group 5) 



Score 


Ethnic 




Omitted 


Not Reached 


Not Attempted* 


Guessing 


index 


Interval 


Group^ 


N 


Mean 


SD 


Mean 


SD 


Mean 


SD 


Miean 


SD 


25 and higher 


W 


278 


0.29 


1.10 


0.18 


0.79 


0.47 


1.48 


13.40 


5.19 




B 


19 


0.79 


2.76 


0.16 


0.69 


0.95 


2.92 


16.74 


7.13 




B4M 


28 


0.86 


2.68 


0.14 


0.59 


1.00 


2.88 


16.39 


6.29 


20 - 24 


W 


217 


0.41 


1.36 


0.37 


1.19 


0.77 


1.96 


18.61 


5.14 




B 


28 


0.36 


0.87 


0.54 


1.26 


0.89 


1.73 


19.86 


4.66 




B-Hi 


34 


0.29 


0.80 


0.44 


1.16 


0.74 


1.60 


19.85 


4.25 


15 - 19 


W 


146 


0.78 


1.95 


1.09 


2. 23 


1.37 


3. 24 


19.89 


6.04 




B 


59 


0.53 


1.79 


0.95 


2.23 


1.''7 


2.88 


23.24 


6.72 




B-Hi 


78 


0.69 


1.92 


1.13 


2.35 


1.82 


3.30 


22.86 


6.69 


14 and lower 


W 


67 


1.22 


3.46 


1.93 


3.67 


3.15 


5.46 


22.82 


8.82 




B 


91 


0.47 


1.29 


1.27 


2.78 


1.75 


2.94 


26.90 


4.89 




B-fH 


105 


0.50 


1.32 


1.22 


2.66 


1.71 


2.92 


26.70 


4.90 


Total 


W 


708 


0.51 


1.74 


0.59 


1.80 


1.11 


2.77 


17.23 


6.64 




B 


197 


0.50 


1.59 


0.96 


2.33 


1.47 


2.78 


23.82 


6.60 




B+H 


245 


0.57 


1.68 


0.96 


2.26 


1.53 


2.91 


23.35 


6.58 



<^^«-s plus Not Reached 

ERIC 



6J 



Table 12 



Number of Items Omitted, Not Reached, Not Attempted, and Guessed on Part 2 
of SAT-V, Given with Formula Directions, for White (W) , Black (B), and Black 
plus Hispanic (B4il) Students, Stratified by Formula Scores on Part 1 

(Based on Group 6) 



Score 


Ethnic 




Omitted 


Not Reached 


Not Attempted 


Guessing 


Index 


Interval 


Group 


N 


Moan 


SD 


Menn 


SD 


Mean 


SD 


Mean 


SD 


21 and higher 


W 


245 


2.53 


2.91 


0.56 


1.39 


3.09 


3.26 


8.46 


6. 33 




B 


14 


2. 86 


3.11 


0.36 


0.93 


3t21 


3.09 


9.43 


6. 24 






1 fi 


2.50 


2.87 


0.44 


1.04 


2.94 


2.82 


10.28 


5.81 


15 - 20 


w 


236 


3.3: 


4.01 


0.78 


1.82 


4.11 


4. 90 


12. 87 






B 


22 


4.14 


4.68 


0.86 


1.61 


5.00 


4.90 


12.82 


9.63 




DTtl 




3.91 


4.62 


0.91 


1.75 


4.81 


4.70 


13.09 


9.41 


3-14 


w 


151 


2.74 


3.95 


1.24 


2. 38 


3.98 


5. 22 


16.64 


9. lO 




B 


56 


2.38 


4.81 


2.05 


3.01 


4.43 


6.36 


19.14 


10. 44 




B-Hi 


72 


2.61 


4.89 


1.85 


2.80 


4.46 


6.26 


18.83 


10.49 


8 and lower 


W 


78 


1.73 


3.58 


1.21 


2.50 


2.94 


4.96 


22.12 


9.10 




B 


100 


1.60 


3.12 


1.74 


2.94 


3.34 


4.39 


24.44 


6.65 




B-Hl 


109 


1.61 


3.24 


1.68 


2.87 


3.28 


4.40 


24 46 


6.91 


Total 


W 


710 


2.75 


1.63 


0.85 


1.93 


3.60 


4.50 


• 13.17 


9-07 




B 


192 


2.21 


3.93 


1.63 


2.77 


3.84 


5.03 


20.47 


9.57 




B441 


231 


2.31 


4.05 


1.53 


2.64 


3.84 


5.02 


20.03 


9.73 



,er!c 

'^liffi^ plus Not Reached 

; ^ * 



-53- 

groups has been stratified on the scores earned on Part 1 of the SAT- 
verbal, In categories of Rights scores: 25 and higher, 20-24, 15-19, 
and 14 and lower (and total) for those groups (Groups 1, 2, and 5) for 
whom Part 1 was administered under Rights directions; and In categories 
of Formula scores: 21 and higher, 15-20, 9-14, and 8 and lower (and total) 
for those groups (Groups 3, 4, and 6) for whom Part 1 was administered 
und3r Formula directions. The means and standard deviations of Items 
Omitted, Not Reached, Not Attempted, and Guessed were determined from 
Part 2 data, separately for Whites, Blacks, and Blacks and Hlspanlcs 
combined. 

The means of the Omitted Items seem to show uneven, but clearly 
different trends for the two sets of directions. There Is a slightly 
greafer tendency for lower-scoring groups than for higher-scoring groups 
to omit Items In Rights-directed test situations, a characteristic 
which Is shared by all three ethnic, groups,, and a generally smaller 
tendency for lower-scoring groups than for higher-scoring groups to 
omit items in Formula-directed test situations. In the first (Rights- 
directed) situation, it is probably a fact that lower-scoring groups do 
not follow the sense of the directions given to them as well as their 
higher-scoring counterparts do, and guess less often than they should 
under these directions. In the second (Formula-directed) situation, 
the same kind of explanation is reasonable: that lower-scoring groups 
again fail to follow the sense of the directions given to them and 
guess more often thin they should. This tendency may also arise from 



65 



J 



-54- 



a greater failure (for them than for higher-scoring groups) to assess 
their own degree of knowledge 5^'nd competence on the items. This obser- 
vation (and speculation) supports the assertions attributed to Ebel (1968) 
and Lord (1977), supported also by data observed in item analyses, alluded 
to earlier, that lower-scoring examinees guess more , not less, than higher- 
scoring students do (Ebel), and more than they should (Lord) on Formula^ 
scored tests. 

Considering the data by ethnic group within strata, tHere appear to 
be no observable trends to report. However, there are differences irt mean 
Omits and in standard deviations of Omits with respect to the nature of 
the directions given for guessing. Those groups who were given Formula 
(restrictive) instructions with respect to guessing in Part 2 of the 
test (Groups 2, 4, and 6) show much greater average numbers and dispersions 
of omitted items than those who were given Rights (permissive) directions 
in Part 2 (Groups 1, 3, and 5). 

The means and standard deviations of Items Not Reached do show a clear 
progression across ability groups, with examinees in the lower ability 
groups showing greater average numbers of unreached items than those in 
the higher ability groups. The NR count is often used as an index of 
speededness and is known to correlate negatively with score. These results, 
therefore, are not unexpected and lend added support to the reasonableness 
of the data. What is of special interest here is the fact that the re- 
lationships between ability and number of items Not Reached are not spurious 
here, in the sense that the correlation might be coerced by the constraint ' 

6G 



-55- 



of the total number of Items. The correlation represented here Is that 
between ability, as measured by the score on Part 1, versus the count of 
' Not Reached Independetftly observed on Part 2. 

The same.. progression of the counts of Not Reached on Part 2, as a - 
function of score level on Part 1, Is observed for all three ethnic groups: 
White, Black, and Black plus Hispanic. There are also some differences 
(but not quite as clear or shairp, probably because of the small samples 
of Blacks and Hlspanlcs) , among the ethnic groups within strata, with 
the numbers of Nonreached Items largest for Blacks, intermediate for the 
Biack-plus-Hl6panlc group, and smallest for Whites. These differences 
are proba'jly attrlbutabjLe In part to differences In score level, even 
within the strata. Because the strata are relatively broad, It should be 
noted that even within strata the groups are not precisely matched on 
ability. As a result It Is possible that ability differences within strata 
may have affected the results. 

We also observe that there are small but persistent differences in 
the numbers of Not Reached Items between the tests administered under 
Rights directions and those administered^' under Formula directions, with 
a strong tendency for those tested under Formula directions than for 
those tested under Rights directions to show a large number of Items Not 
Reached. There are three reasonable, and not Incompatible, explanations 
for this. One Is that Formula directions require more time on the part 
of the examinee; In view of the penalty for Incorrect respenses, the 
\ examinee is obliged to consider and weigh his (her) responses a little 



ERIC 



67 



-56- 



more carefully tha*i if the answer sheet were to be scored Rights Only. 
A second is that students may have engaged in blind guessing under Rights 
directions near the end of the test, causing an underestimate of the 
number of items not considered. The third is that the distinction be- 
tween Omitted and Not Reached items is probably somewhat contaminated, 
even if not seriously so. The distinction between these two types of 
Nonattempts assumes that the examinee progresses systematically through 
the test, responding or omitting as he or she goes along, without skipping 
or returning to an item considered earlier. Indeed, it is entirely likely 
that on occasion an examinee will omit an item with the intention of re- 
turning to it if time permits—but then time does not always permit. 
It is also likely that seme items classified as Not Reached are in fact 
considered and intentionally omitted. These explanations may well account 
for the fact that the data on the Not Reached items show differences 
between the two types of directions, not as clearly, to be sure, but in 
Ihe same direction as shown by the counts of Omitted items. 
Effect of Directions on Guessing 

Considerable thought was given to investigating the extent to which 
examinees at different score levels guess the answers to the items, and 
a highly simplified measure of guessing, namely W-O, where W and 0 are 
respectively, a count of the number of items answered incorrectly and 
the number of items omitted, was considered for this purpose. (Note 
again, that the number of items omitted is taken to include only those 
items presumed to have been examined, considered, and consciously skipped. 



63 



-57- 



apart from those that the examinees are presumed not to have reached with- 
in the time limit, those near the end of the test. The items Omitted are 
the unmarked items followed by one or more marked items; the items Not 
Reached are the unmarked items that are not followed by any marked items.) 
The justification for the proposed index, W-0, is that W includes all items 
for which it may safely be assumed that the student had less than complete 
knowledge. (W may also include other errors, for example, those arising 
from failure to understand the directions and from carelessness in marking 
the answer sheet.) Assuming that the student had no clerical errors in 
responding, then, it is presumed that any student who responded without 
complete knowledge did so with at least some degree of guesswork. The 
subtraction for Omits was introduced as indicating a conscious suppression 
of any tendency to guess. Because some wrong answers arise from partial 
or incorrect knowledge, the index cannot be regarded as a pure measure of 
guessing tendency. However, for students of equal ability, W-0 should be 
a useful indicator of relative tendency to guess. 

Other investigators have also developed indices of guessing. A 
review of these developments and their applications appear in Slakter 
(1967), Swineford (1938), for example, reports on the development of a 
. ^asure of "gambling tendency," derived frcu a special administration of 
the test, in which she asked the students to rate their level of confidence 
on a 3--point scale (2, 3, or 4 with the rating of 4 representing high 
confidence) in responding to true-false items. The formula for her index 
of gambling is the following: 



ERLC 



6'J 



-58- 



Gambling ■ G « 



Errors Marked "4'' 



X 100, 



Total Errors + 1/2 Omissions 



In a later article Swineford (1941) found that the index was "independent 
of the scores on the tests from which they were computed and also independent 
of five mental factors [General, Spatial, Verbal, Speed, and Memory] which 
have been measured by a larger battery of tests/' She also found that 
the "intercorrelations among the [index] scores from [four tests administered 
to the experimental subjects — Paper Form Board, General Information, Word 
Meaning, and Deduction — ] are sufficiently high to yield a multiple corre- 
lation of ,85 when all four measures arie combined in a regression estimate 
of the G [guessing] factor" (1941) . 

Ziller (1957) also offers an index of guessing, but provides no empirical 
data to test its reasonableness. Unlike Swineford 's index, Ziller 's is 
based entirely on the ordinary item responses given by the students in the 
regular administration of the test, unmodified by special instructions (e,g,, 
expressing degrees of confidence in the responses). Ziller 's index is a 
proportion of the Wrongs to the total oi the Wrongs and the items Not At- 
tempted, in the following relationship: 



2 « 



[k/(k-l)] W 
[k/(k-l)] W + NA * 



where k » the number of options per item 



W « the number of incorrect responses 



NA * the number of Nonartempted items, which includes 
what the present authors refer to as "Omits" plus 
what they refer to as items "Not Reached." 




It was expected, in considering the W-0 index, that it would correlate 
negatively with score level. To determine the strength of this relation- 
ship, scores were correlated in Groups 1 and 4 with: a) the scores 
of the tests from which they were calculated; and (b) the scores of the 
parallel tests administered in the other half --hour session. Thus, if R, 
F, and I are taken, respectively, to represent Rights scores. Formula 
scores, and the index, W-0; and if the subscripts 1 and 2 are taken, 
respectively, to represent measures derived from the first and second parts 
of Form A, then the following correlations can be evaluated: 

Group 1; N'^1026 Group 4; N=1038 

r - -.605 r - -.587 

Vl 11 

r « -.533 r = -.514 

h 2 12 

r^ . = -.503 r = -.496 

*^2 1 2^1 

r - -.676 r = -602 

**2^2 2^2 

As may be seen in the correlations given above, the index, W-0, correlates 

negatively at a substantial level with ability scores. The correlations 

are^ as expected, higher when the index is based on the same test performance 

as i3 the measure of ability. This is so for the obvious reason that the 

ability scores (R and F) are necessarily negatively correlated with W and 0 — 

constrained as they are by the number of items in the test, which is constant 

for all examinees, and also constrained by the fact that R (or F) would 

necessarily correlate irore strongly with W than with 0. But even when the 

correlation is carried out between tests (as in r_ _ and rp , for example), 

R^l2 R2^1 



-60- 



there is a stronger relationship than would be ideal in a study involving 
guessing and cognitive scores. The same pattern of relationships as that 
observed between Rights scores and the index is seen in the correlations 
between Formula scores and the index. The latter correlations, however, 
are consistently smaller than the former. 

It Is also noted that the correlations for Group 1 are much smaller 
than would have been obtained if the students had followed the Rights 
directions strictly. If they had, there would have been no items omitted 
or not reached, and the correlation between Rights scores and the index, 
W-0, would have been -1.00. 

Because the various groups tested in the experiment with SAT-verbal 
were very nearly equal in ability, W-0 may be used as an indicator of 
the extent to which directions affected the tendency to guess. The fol- 
lowing table shows the mean Total scores on W-0 for groups tested under 
Rights directions and Formula directions. 













Mean W-0 


Test 

SAT-verbal 
SAT-verbal 


Form 

A 
A 


Grou£ 

1 
4 


Directions 


No. of Cases 


(Parts 1+2) 


Rights 
Formula 


1026 
1038 . 


39.28 
31.61 


SAT- verbal 
SAT- verbal 


B 
B 


5 
6 


Rights 
tormula 


1040 
1034 


39.56 
31.35 


Chemistry 
Chemistry 




7 
8 


Rights 
Formula 


1151 
1155 


47.24 
32.04 


The tabulations of 


the "guessing;; score 


," \,l-0, by ability level and 



jthnic group (Tables 7-12) make it clear that there is more guessing— 



-61- 



at least, more Items Wrong, even with the Omits subtracted out—In the 
lower-scoring groups than In the higher-scoring groups. Again, it should 
be pointed out that the "guessing score" Is necessarily correlated negatively 
with ability, even on a.i Independent set of items, since the Wrongs component 
of W-0 is very nearly a complement of the Rights, whi;h is expected to show a 
high correlation across tests. At the same time it should also be pointed 
out that the Wrongs score, moderated by the Omits, is a reasonable measure 
of guesslrg, since, obviously, an item marked incorrectly indicates that 
the examinee Joes not know the correct answer to the item and is in fact 
making either a random or an uneducated guess. 

Reading across ethnic groups, it is clear that the Whites do less 
"guessing" than either the Black or the Black-piu=i-Hlspanlc groups, even 
within strata. This is an observation that, admittedly, may be^ as much 
related to the fact that t^e Wrongs score, which is the heavier component 
of the W-0 guessing index, is necessarily correlated negatively with the 
Rights score and is evident in these data at least in part because of the 
' breadth of the strata. 

When comparisons are made between the two modes of instruction, it 
Is observed that the examinees tested under Rights directions do in fact 
guess more than do the examinees tested under Formula directions. This 
difference is not the same at all levels; it is much sharper and clearer 
for higher-scoring students than for lower-scoring students, suggesting 
that the lower-scoring students do not follow instructions as well as 
higher-scoring students or as well as they should, and do not observe the 



ERIC 73 



strategies for guessing as wisely as they should. 

It is also of some interest to compare the correlations between 
Part 1 scores^nd Part 2 scores on Wrongs, Oniits and W-0 scores for 
Groups 1 and 4, who took the test under the same directions on both parts, 
with the correlations for Groups 2 and 3, which took the test under differ- 



ent directions on the two parts. 



Group 


Directions 
on the Two Parts 


No. of 
Cases 


Correlations 
Wrongs 


between 
Omits 


Parts 1 and 2 
W-0 


1 


Same 


1026 




.768 


.784 


,775 




Different 


1068 




.686 


.246 


.538 


3 


Different 


1054 




.695 


.318 


.568 


4 


Same 


1038 




.807 


.770 


.819 


These results make it clear 


that both Wrongs 


and Omits yield 


smaller corre- 



lations when the directions for the two parts are different than when they 
are the same, as would be expected. The differences in these correlations 
are sharper and clearer for Omits than for Wrongs, ^ suggesting that a simple 
count of the Omits might be an even better indicator of guessing than W-0 
because it would not be affected by wrong answers attributable to partial 
information and misinformation, which tend to impair the clarity of the index. 

It would appear that it is not likely that one can develop a satis- 
factory index of guessing that would be derived solely from the responses to 
the test itself. Although it is quite reasonable to believe that there 
should Se no correlational relationship between the tendency to guess and 
cognitive ability in the abstract sense, there is good reason to believe 
that guessing does affect cognitive test scores and therefore would 
correlate with them (L. R Tucker, personal communication), as it has in 

7.} 



the data cited above. The question, however, is just how strong that 
relationship should be. The data resulting from this study might con- 
ceivably be useful in studying the role of guessing in Rights- and Formula- 
scored tests, but until we can be more confident of the validity of the 
guessing index itself, we cannot be confident of the usefulness of the 
actual data resulting from a study of the correlation of the index with 
test data. 

Effect of Directions and Scoring on Reliability and Parallelism 

The design of the study permitted close examination of the correla- 
tion between the two parallel sections of the SAT-verbal under all possible 
combinations of administration and scoring, and also permitted the examina- , 
tion of test-retest and internal consistency estimates of reliability. 
Finally, it permitted an evaluation of the question whether a change from 
Formula-scoring to Rights-scoring might not entail so extensive a change 
in test-taking behavior as to affect the parallelism of the test forms and, 
consequently, the general applicability of any equating that would be under- 
taken to make comparable the scores earned under the two types of directions. 

Before considering the results of the reliability studies, it should be 
useful to review certain points that affect their interpretation. First, 
in interpreting reliability coefficients obtained under Formula directions, 
the possibility that these coefficients overestimate the reliability of the 
scores because of the prese ce of noncognitive variance should be considered. 
This noncognitive variance arises, according to the Differential Effect 
Hypothesis, because examinees differ with respect to their willingness to 
answer questions about which they have useful partial knowledge. On 



ERIC 



75 



-64- 



loglcal grounds^ this effect In Inflating reliability^ would be accen- 
tuated » rather than diminished, If Rights rather than Formula scoring 
were used with Formula directions. Second, In Interpreting the reliability 
coefficients for Rights directions, the possibility exists that an 
analogous effect may arise when, as In this study, there are a sub- 
stantial number of unanswered Items. Here, again, these unanswered Items 
may Introduce noncognltlve variance into the scores that raises the re- 
liability coefficients artificially. The presence of unanswered items 
introduces another complication. Suppose that the examinees had followed 
the instructions to answer every item, and in doing so had engaged in 
blinc' guessing. There is reason to believe that the reliability coefficients 
obtained under sucl conditions would be lower than those actually found. 

If, as seems likely, the effects of these complications on the 
reliability coefficients are small, the empirical results of this study 
should be regarded as providing useful but not conclusive evidence on 
the relative reliability of Rights and Formula scores. 

Tabla 13 provides information on these questions. The first section 
of the table gives observed-score correlations between Part 1 (45 items) 
and Part 2 (40 items) under all possible combinations of scoring methods. 
In spite of the different lengths of the parts of the test, these correla- 
tions all represent parallel-forms reliability coefficients and are all 
comparable insofar as length is concerned, since they all represent the 
correlation between the 45-item Part 1 and the 40-lt€fm Part 2. The itali- 
cized figures are the correlations between the results of scoring procedures 
that were appropriate to the directions actually given in the administrations. 



ERIC 



Table 13 



Observed-Score Correlations, Reliability Coefficients, 
and True-Score Correlations between Part-Scores 



Fon A 



Observed-Score 
1 «* 



Reliability 
Coefficients" 



True- Score 
Correlations" 



Group 


No. of 
Cases 


Directions 
Part 1 Part 2 


1 / 


""r^f 


It £ 


'^^2 


Vr;^ 
1 1 


JL 1 


r« „ib 

r,r: 


''f f'^ 

L L 


i. £ 


V2 


^F R 


1 4 


1 


1026 


Rights 


Rights 


.822 


.813 


.816 


.818 


.848 


.838 


.828 


.821 


.982 


.975 


.980 


.986 


2 


1068 


Rights 


Formula 


.788 


.798 


• 785 


.804 


.8S8 


.830 


.809 


.809 


.956 


.970 


.958 


.982 


3 


1054 


FotMila 


Rights 


.806 


.801 


.810 


.815 


.845 


.840 


.822 


.812 


.967 


.967 


.975 


.987 


4 


1038 


Fonula 


Foraula 


.819 


.809 


.811 


.824 


.831 


.825 


.803 


.802 


1.003 


• 991 


.997 


I.OIZ 
















Fora B 


















5 


1040 


Rights 


Rights 


.820 


.809 


.623 


.820 


.849 


,837 


.844 


.i a 


.968 


.957 


.979 


• 977 


6 


1034 


FoxBula 


Foraula 


.826 


.815 


*829 


.83? 


.846 


.842 


• 842 


.844 


.978 


.965 


.984 


.992 



'correlation* between variables for which the scoring Is consistent with the directions appear In Italics. 
'Vud^r-Rlchardsott Fonwla (20) ral lability. 
'i>ressei (1940) adaptation of Kit (20) reliability. 



ERIC 



77 



-,66- ^ . - 

n 

In general, the correlations are very similar, ranging in Form A from 
.785 to .824. the two highest correlations in this section of the^ table 
are those between the two parts of the test when administered and scored 
in the same way (r^ ^ « .822 and r_- „ = .824). The difference between 

^ ^1^2 ^r2 

these two correlations is very small and therefore perhaps of little 
consequence, but it does suggest the possibility that Formula directions 
and scoring may be slightly more reliable than Rights directions and scoring. 
This conclusion is supported by the observation that the correlation? ^-^ween 
Formula scores on the two parts (r- ) are higher, on the average, than any 

*'r 2 

of the other three types of correlations (r., » p » °^ ^i? p ^ ^^^^ 

though the method of scoring in some of these instances is inconsistent 

with the directions. Finally, it is observed that the correlations, ^r^r^ ^• 

and r (.824), which are the correlations between two tests administered 
^1^2 

and ^scored in the same way, are higher than the other correlations along 

the main diagonal, r„ - (.798) and r_ _ (.810), which are the correlations 

R^F^ F^R^ 

between tests that were administered and scored in different ways. 

The correlations between the parts of Form B also represent a narrow 

range, from .809 to .837. As in the corresponding first section of the 

table for Form A, the highest correlation between the parts of the test is 

that when adtniuistered and scored by Formula (r « .837), higher than 

r 2 

the value, r « .820, obtained when Rights directions and scoring were 
V2 

used. 



-67- 



It may also be observed in the data for Form A that the correlations 
between Parts 1 and 2 are higher when the directions for administration 
of the two parts are the same (Groups 1 and 4) than when the directions are 
different (Groups 2 and 3); also that the correlations are higher when the 
scoring methods are the same (columns 1 and 4) than when the scoring methods 
' are different (columns 2 and 3). However, the differences are not great, 
indicating that differences in rank ordering of examinees on two tests 
administered and/or scored in different ways are not much different from 
the differences in rank ordering when the two tests are administered aad 
scored in the same way. The data for Form B confirm the finding that, on 
the average, the correlations are slightly higher when the scoring methods 
for the two parts are the same (columns 1 and 4) than when the scoring 
methods are different (columns 2 and 3). 

The second section of the tables for Form A gives KR (20) reliabilities 
for each of the two parts of the test, for each of the scoring methods, and 
for each of the four groups of examinees (and modes of administration) . 
As expected, the reliabilities for Part 1 (^5^^^/^^^^ ^Vl^ higher than 

the reliabilities for Part 2 (r , and r ,), since they are based on a 

V2 *2*2 

slightly longer test section. It is also seen that the KR (20) relia- 
bilities for Rights scores on Part 1 (column 1) are higher than the 
corresponding, reliabilities for Formula scores (column 2) , whether the 
method of scoring follows the directions for administration or not. This 
is true for Part 2 also, but not as clearly (see columns 3 and 4). 



*The reliabilities for Formula scores were calculated using Dressel's 
(1940) adaptation of KR (20). 

ERIC 



-68- 



The results for Form B are generally consistent with the results 
for Form A. The KR (20) rellabll'lty for Rights is slightly higher 
than for Formula for Part 1 (.849 vs .842); for Part 2, the relia- 
bilities are equal (.844). 

It should be observed that the KR'(20) results may overestimate to 
some extent the reliabilities of the Formula-directed tests becau^ of the 
slightly greater speededness characteristic of such administrations. (For 
example, the index of speededness, a^j^/a^. is, almost without exception in 
these data, higher for formula-directed administrations than for Rights-' 
directed administrations.) In effect, the reliability of Rights-directed 
tests, as observed in these data, may actually be undervalued, and the 

difference in KR (20) reliabilities in favor of Rights-directed tests may 

i . 

well be greater than it appears here. 

In general, then, it appears that the two methods of conceptualizing 
reliability— the correlations between parts 1 and 2 (parallel-forms) and 
the internal-consistency coefficients (KR (20))~yield contradictory 
results on this point. However, the differr es are quite small, and 
taken together, the overall results show clearly that the two modes of 
administration and scoring yield very nearly equal reliabilities. On 
this basis, it is reasonable to conclude that considerations of test re- 
liability should not be decisive in choosing one method of administration 
and scoring ov^r the other. 

The third section of the table presents evidence on the parallelism of 
the two conditions of testing and scotlng, beat Indicated by the true-score 



So 



-69- 



correlatlons, .970 and .975, in the second and third columns of the table 
for Form A. Although these correlations, along with the others in the 
second and third column-,, are generally lower than those in the first and 
fourth columns—in particular, the values of .981 and 1.013 (essentially, 
1.00)— tbey are sufficiently close to 1.0(T ax»pel any concerns that 
they may be measuring substantially different abilities. 
Ef f:;Ct of Directions on Equating/ Method of Analysis 

The principal objective of this study w^s to evaluate the precision 
and statistical bias, if any, of equating results obtained by assuming 
that students perform equally well under Rights directions and Formula 
directons when Formula scoring Is used. Essentially, 'the analysis methods 
call for comparing the results obtained by equating Rights scores to 
Formula scores using standard and "ideal" equating methods with results 
obtained by equating Rights scores to Formula scores by assuming the In- 

variance Hypothesis. 

In preparation for a description of these methods it will be useful to 
describe again the method of administering the tests. As set forth earlier 
in this report. Form A of SAT-verbal was administered to four spiralled 
groups, ekch of which was given a different pattern of directions. Form B 
of SAT-verbal was administered to two spiralled groups (which were also 
spiralled with the four groups taking Form A) , each of which was also given 
a different pattern of directions. Data for Form B were used principally 
for confirming the results of some of the analyses for Form A. The Chemistry 
Test was also administered to two spiralled groups, each of which was also 



ERIC ^1 



-70- 



given a different pattern of directions. The sample taking the Chemistry 
Test was drawn from an entirely different population from the one fror 
which the SAT-vtrbal (Forms A and B) sample was drawn. Both forms of SAT 
verbal were given In two separately-timed 30-mlnute parts. The Chemistry 
Test was given in a single 60-mlnute sessiun. 

The directions used for each of the eight groups were as follows: 



Directions for Admlulst ration 



Form A 

Group Part 1 Part 2 



1 Rights Rights 

2 Rights Formula 

3 Formula Rights 
\ Formula Formula 

Form B 

Part 1 Part 2 

5 "Rlg^ ts Rlga>s 

6 Formula Formula 



Chemistry 



7 Rights 

8 Formula 



The experimental design permitted the use of five equating methods (but 
with different subsets of the data), 3S follows; 

1. Spiralling Method , As described earlier In this repoit, this 
method calls for distributing the tests in sequence within each room in which 
the test is administered As a result of this process, the samples of studente 
^king each form will represent systematic samples of the total group tested.. 



ERLC 



82 



According to probability theory, each subsample will tend to become in- 
creasingly similar to the other subsamp^es as sample sizes increase. 
Thus, for large samples it can be assumed that any two subsamples are 
approximately equal in the abilities measured by the tests to be equated. 
Scor*:!* on two tests are equated by setting equal the means and standard 
deviations of the samples taking those two tests. The result of the 
equating is that transformed scores on one test will have the same mean 
and standard as the observed scores on the other test. (For a fuller 
discussion of this method, see Angoff (1971, pp. 569-571).) 

2. Maximum Likelihood Method . This method calls for admin- 
istering each of the two tests to be equated to a random sample of students 
and administering the same link, or anchor, test to £ll members of both 
samples. The analytical procedure calls for the estimation of the mean 
and variance of both tests for the total combined sample, and for setting 
equal the estimated means and standard deviations for the two tests, as 

is done in the Spiralling Method. The link test serves to increase the 
precision of the equaling results. This method is described fully by 
Angoff (1971, pp. 576-579). 

The following two methods make use of the Invariance Hypothesis: 

3. Invariant Llr.k Method . This method makes use of the same 
design as that used in the Maximum Likelihood Method. Each group takes one 
of the two tests that are to be equated. In addition, both groups take the 
same link test. However, here, one group takes the link test under Rights dl 
rections and the other group takes the link test under nula directions. 



83 



-72- 



Equatlng Is performed by rescorlng by Formula the link test taken under 
Rights directions and assuming that such scores can be treated as Inter- 
changeable with Formula scores earned under Formula directions. The 
analytical method used for treatl ig these data Is then Identical to that 
for Maximum Likelihood equating. 

4. Invar lance Method . In this method, the equating Is based 
on the results of a single test administered to a single group. A test 
given under Rights directions and scored Rights Is also scored by Formula. 
It is then assumed that the Formula scores so obtained are equivalent to 
the Formula scores that would have been obtplned had Formula directions as 
well as Formula scoring been employed for that group. The equating procedure 
then calls for the direct equating of Rights scores to Formula scores for 

the same Individuals l>y setting equal their means and standard deviations 

on the two types of scores. 

5. Identity Method . Although not a method of equating in the 
usual sense. It is useful to consider as a criterion, or ideal, the results 
of an "equating process" which yields a perfectly predictable result. This 
is one in which a test is "equated" to itself, one which necessarily yields 
a slope parameter of exac*^ly 1 and an intercept parameter of exactly 0. 
The advantage of considering this "method of equating" is that the study 
design permits the equating of scores obtained using a particular type of 
directions and a particular method of scoring to scorch on the same test 
using the same type of directions and the same method of scoring, but with 
data based on two independent groups, and the opportunity to compare these 

ERIC 



-73- 



results with the ideal criterion, represented by the Identity Method. 

All equating undertaken in the study, except for the auxiliary 
equating to be described, involved the conversion of: 

1) a part score to itself, with the tests for both groups 
administered and scored by Rights, or with the tests for both groups 
administered and scared by Formula; 

2) a Rifc.hts score on a 30-minute ^art of the test to a 
Formula score on the same 30-minute part; or 

3) a Rignts score on the full 60-minute Total test to a 
Formula score on the full 60-minute Total test. 

The following enumeration may be helpful in distinguishing among 
the several types of squatings: 

The first of the three types outlined above comprised 12 

equatings, as follows: 

a) Equating 1R(2) to 1R(1) . (Read this: equating Rights 
scores on Part 1, usinr, data for Sample 2 to Rights scores on Part 1, using 
data for Sample 1.) Tie description and the result of this type of equating 
appiiar iu Table 14, page 78, equations 2 and 3. r 

b) Equating 2R(3) to 2R(1) . Table 14, equations 5 and 5. 

c) Equating 1F(3) to 1F(4) . Table 14, equations 8 and 9. 

d) Equating 2F(2) to 2F(4) . Table 14, equations 11 and 12. 
These four s«ts of equatings we^-e each done by the Spiralling and 

Invariant Link methods and compared with those done by the Identity Method 
(Table 14, equations 1, 4, 7, and 10). 



ERIC 



-74- 



The second of the three types comprised 18 equatings, 

and are described as follows: 

a) Equating lR(l+2) to lF(3+4) by the Spiralling and 
Invariant Link methods. (Read this: equating Rights scores on Part 1, 
using data from Sample 1 plus Sample 2 to Formula scores on Part 1, using 
data from Sample 3 plus Sample 4.) Thes^ equatings are found in Table 14, 

equations 13 and 14. 

b) Equating 1R(1) to 1F(3) by the Spiralling and Maxi- 
mum Likelihood methods. Table 14, equations 16 and 17. 

c) Equating 1R(2) to 1F(4) by the Spiralling and Haxl- • 
mum Likelihood methods. Table 14, equations 19 and 20. ' 

d) Equating lR(l+2) to lF(l+2) (equation 15), 1R(1) to 
1F(1) (equation 18), and 1R(2) to 1F(2) (equation 21), by the Invarlance 
Method. Note that the Invarlance Method did not require two samples, as 
did the Identity, Spiralling, Invariant Link, and Maximum Likelihood methods, 
but only the one sample, administered unde.: Rights directions. 

e) Corresponding to the foregoing 9 equatings were the 9 
equatings of 2R to 2F (Table 14, equations 22-30), based on appropriate 
groups . 

The third type of equating outlined on page 73, is further 
described on page 76. 

Additional auxiliary procedures were brought into play for the fore- 
going two types of equating in order to express all part-score equatings 
as transformations of: a) raw total Rights scores on Form A to the College 
Board SAV-verbal scale; b) raw Total Rights scores on Form B of the College 



ERIC 



Board SAT-verbal score; and c) raw Total Rights scores on the Chemistry 
Test to the College Board scale. This process may be described by 
saying that when a part-score, x, on Form A was to be equated to another 
part-score, y, on Form A, an attempt was made to express x In terms of 
Total raw scores on Form A. This was accomplished by applying the mean- 
and-8lgma method to a single sample, rather than to two separate samples, 
as Is ordinarily done In equating one test to another. Specifically, 
when It was necessary to equate Total scores on Part 1 to Rights scores 
on Part 1, the data from Sample 1 were used; li our adopted notation TR(1) 
was equated to 1R(1). Similarly, Total Rights scores were equated to 
part-scores 2R, IF, and 2F, as needed, as follows: TR(1) to 2R(1) , TR(1) 
to 1F(4), and TR(i) to 2F(4). In like manner, an attempt was made to 
express part-score, y, on Form A In terms of Total Formula scores on Form 
A. This too was accomplished by applying the mean-and-slgma method to a 
single sample, rather than to two separate samples, and was carried out, 
as needed, as follows: 1R(4) to TF(4), 2R(4) to TF(4), 1F(4) to TF(4), 
and 2F(4) to TF(4). '^^Inally, Total Formula scores on Form A were 
expressed In terms of the College Board scale by the use of conversion 
parameters for Form A available In file. 

The foregoing conversion steps may be summarized In the following 



diagram: 



Total 

Rights 

Score 



a 



Part 

■) Score 
(x) 



b 



Part 
^ Score 

(y) 



C 



■) Formula 
Score 



Total 



d 



College 
^ Board 
Scale 



37 



-76- 



The link of particular Interest In this process Is the link between one 
part-score (x) and another (y) , or, when analyzed In comparison with the 
Ideal, between a part-score and Itself (Link b) . The equating link (a) 
Is an auxiliary equating which permits the expression of part-score x In 
terms of Total Rights score on Form A, The equating link (c) is a second 
auxiliary equating which permits the expression of part-score y In terms 
of Total Formula score on Form A. Link d is a set of conversion para- 
meters available in file, developed at the time that Form A was first 
Introduced as an operational SAT, and used to express raw Formula scores 
on Form A in terms of College Board scaled scores. The application of the 
succession of links diagrammed above makes it possible to span the links 
and express all conversions undertaken In the eqaating of part-scores as 
conversions of Total Rights scores on Form A to College Board scaled scores. 
Further, any differences in scaled-score results within each set of three 
equatlngs, involving a particular conversion in Link b, are attributable 
to methodological differences in effecting that link. 

The third type of equating outlined on page 73, involving the 
conversion of Total Rights scores to Total Formula scores, comprised 9 
equatlngs as follows: 

a) Equating TR(1) to TF(4) for SAT-verbal, Form A, shown 
in Table 15, page 85, equations 31 and 32, 

b) Equating TR(5) to TF(6) for SAT-verbal, Form B, shown 
in Table 15, equations 34 and 35. 

c) Equating TR(7) to TF(8) for the Chemistry Test, shown 

y 

83 



-77- 



In Table 15, equations 37 and 38. 

Each of the foregoing equatlngs was done by the Spiralling 
and Invariant Link methods. 

d) Equating TR(1) to TF(1), TR(5) to TF(5), and TR(7) 
to TF(7) by the Invarlance Method, shown In Table 15, equations 33, 36 
and 39. Note again that the Invarlance Method did not require two samples, 
as did the Identity, Spiralling, Invariant Link, and Maximum Likelihood 
methods, but only the one sample, administered under Rights directions* 

In all equating analyses, the examinees* Formula scores wete rounded 
to Integers In accordance with the convention used In operational scoring, 
which Is to round upward all scores whose calculated values end In .5. 
In determining Total scores on Forms A and B of the SAT-verbal, part-sc res 
were rounded to Integers before adding them. 
Effect of Directions on Equating: Findings 

Table 14 describes the essential features of 30 equating sequences re- 
la Ing total Rights scores to Total Formula scores on Form A of SAT -verbal. 
Results of each equating sequence are expressed on the College Board scale, 
utilizing file parameters relating Form A Total Formula scores to the 
standar(f score scale. Because the equating of Total Rights scores to part- 
scores (Link a) and the equating of part-scores (Formula) to Total Formula 
scores (Link c) was performed simply by setting the mean and standard de- 
viation of the designated Total score equal to the mean and standard of 
the deslgi.ated part score, the equating method used to establish these 
links Is not specified In Table 14, For the equatlngs relating part scores, 




Table 14 



Conversion Paraaecera Relating Total Righta Score on SAT-Fom A to College Board Scale, Determined on 

Various Data Bases and Various Methods of Equating* 



Equating Total Rights 
Scores to Part Scores 

Total Pert 
Eq. Rights and 
No. Saaple Saaple 



1 

2 

3 

4 

5 

6 

7 

8 

9 
10 
11 
12 

13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 

TT 
29 
30 



ERIC, 



Equating ot Part Scores 



Equating Part bcores to 
Total Foraula Scores 



IR (1) 



Part 
and 

Saaple 

IR (-) 
IR (2) 



Link 
Teat 



2F 



Part 
and 
Sample 

IR (-) 
IR (1) 



Equating 
Method 

Identity 

Spiral 

Inv.Llnk 



Part 
and 
Saggle 

IR (1) 



Total 
Foraula 
Saaple 



2R (1) 



2R (-) 
2R (3) 



2R (-) 
2R (1) 



IP 



Identity 

Spiral 

Inv.Link 



IF (4) 



IF (3) 



IF (-) 
IF (4) 



2F 



Identity 

Spiral 

Inv.Link 



2R (1) 
If 

It 

IF (4) 



2F (4) 



IR (1) 



2F (-) 
2F (2) 



IR (1+2) 



IF 



2F 



2F (-) 
2F (4) 



IF (>f4) 
ti 

IF (1+2) 



Identity 

Spiral 

Inv.Link 

Spiral 
Inv.Link 
Invar iance 



2F (4) 



IF (4) 



IP (1) 



2R 



IF (3) 
It 

IF (1) 



Spiral 
Max.Lik. 
Invar iance 



IR (2) 



2F 



IF (4) 
tt 

IF (2) 



Spiral 
Max.Lik. 
Invar lance 



2R (3) 



2R (li-3) 



IF 



2R (1) 



2F (244) 
II 

2F (1+3) 



Spiral 
Inv.Link 
Invar iance 



2F (4) 



IR 



2R (3) 



2F (2) 
II 

2F (1) 



Spiral 
Max.Lik. 
Invar iance 



IF 



2F (4) 
It 

2F (3) 



Spiral 
Max.Lik. 
Invar iance 



Parameters 

Slope Intercept 



Scaled Score 
Whmn (X) is: 



8.1832 
8.4574 
8.1570 



8.1838 
8.3065 
8.2288 



70.7859 
60.9408 
69.5821 
70.7856 



66.0434 
70.2317 



17 

210 
205 
208 
210 
207 
210 



8.1838 
7.8057 
8.0561 
8.1835 
8.0419 
8.2350 

8.5269 
8.6454 
8.7341 



70.7851 
81.4357 
76.4712 



210 
214 
213 



70.7836 
75.9928 
69.9221 



59 

54. 

51. 



9689 
7689 
9801 



8.5792 
8.6650 
8.7029 



59. 
56. 

53. 



6247 
3627 
2754 



210 
213 
210 

205 

202 

200_ 

205 

204 

201 



8.4571 
8.5545 
8.7654 



60. 
57, 
JO, 



9399 
4419 
6634 



205 
203 
200 



8.3190 
8.5182 
8.7572 



8.3277 
8.5095 
8.8050 



TTSSST 

8.5718 
8.7080 



65 
60 
53 
"55 
59 
50 



6865 
1132 
5538 



4856 

14U 
.7599 



66 
^ 59 
56 



0394 
2453 
.3567 



207 
205 
_202 
"207 
204 
200 

ssr 

205 
204 



*The notations. R and F, in this table refer to the method of scoring used for equating the part, which does not 
necessarily correspond to the directions under which the part wa> administered. 



40 

398 
399 
396 



398 
398 
399 



398 
394 
399 



398 
398 
399 

401 
401 
401 



403 
403 
401 



399 
400 



398 
401 
404 



399 
400 
403 

"W 

402 
405 



85 

766 
780 
763 



766 
772 
770 



766 
745 
761 



766 
760 
770 

785 
790 
794 



789 
793 



780 
785 
J3L 



773 
784 
798 
773 



782 
799 
"772 
788 

797 



I 

00 

I 



-79- 



however, the table specifies the sample (if any) for each part score and 
the equating method that was used. In equatings utilizing a link test, 
both the part designations (1 or 2) and the scoring method (Rights or 
Formula) are specified. 

The first 12 equatings shown in Table 14 all involve equating of a 
part score to itself. The equating method designated ^'Identity*' shows the 
results that would be obtained by perfect equating. The equating method 
designated "Spiralling** shows the results obtained by two essentially 
randonr samples that took the designated part. The equating method de- 
signated ''Invariant Link'^^uses data on the other separately-timed part of 
SAT-verbal as an equating section, by assuming that Formula scores for 
students tested using Rights directions are directly comparable to Formula 
scores obtained by students tested under Formula dir c -ions. The 12 results 
shown first in Table 14 were obtained by applying each of these methods 
(Identity, Spiralling, and Invariant Link) to each of 4 part scores 
(Part 1-Rights, Part 2-Rights, Part 1-Formula, and Part 2-Formula.) 

Considering first the results for scaled scores for the 12. equating^*, 
it is clear that there is quite close agreement for a Total Rights score 
of 40. The pange of scaled scores is from 394 to 399. Variation at the 
chance score level (17 Rights) is from 205 to 214. At the perfect score 
level (8^^, the scaled scores range from 745 to 780. Of th^se 12 equatings, 

there is one. Equation 8, based on the Spiralling Me^od, that stands out 

» 

as clearly different from the rest. If that one equating is set aside, the 
remaining 11 equatings show a variation of 205 to 213 at the chance score 



ERIC 



> xJ ^ 



-80- 



level, 396 to 399 at score ^0, and 760 to 780 at score 85, 

The variation In maximum scaled scores sugge^.ts that It would be 
useful to compare the slope parameters obtained by the various methods. 
Figure 1 shows the plopes classified according to the method used to 
determine them. As would be expected, the 4 equatings based on the fact 
that the part scores being equated are Identical yield slopes that ar^^. 
the same except for rounding error. Results for the Spiralling and for . 
the Invariant Link methods seem to agree reasonably well, on the average, 
with results of the Identity Method, although the variation of results 
from one sample to another Is noticeably greater for Spiralling than' for 
the Invariant Link Method, This is to be expected on theoretical grounds, 
in view qf the greater sampling error of the Spiralling Method for a given 
sample size. On the whole, these results may be considered to suggest 
that the Invariant Link Method should be useful in making the transition 
from Formula scoring to Rights scoring without disturbing the continuity 
of the scale. 

The remaining 18 equating^ oiiuwn- in T^ble 14 show relatively little 
variation in scaled scores corresponding to a Rights score of 40, These 
scaled scores range from 398 to 405, At thes^chance score level (17 Rights), 
the range is fro^i 200 to 207 and at the maximum score level (85) , the range 
is from 772 to 799, Again, it should be useful to consider the possibility 
of systematic differences in slopes for the^ various methods. Figure 2 



the Invarlance Method yields slopt s that are consistently high, and Spiralling 



shows the slopes obtained 




For these 18 equatings. 



53 





Methods Hot Assuming 


Method 




In variance 


Assuming 
Invariance 








Invariant 




laentity 


Spiralling 


Link 


8.50 












• 




ft An 


- 






8.30 


— 


• 


t 


8.20 


• • • • 




• 


8.10 




• 


• 


8.00 








7.90 








7.80 




• 





Figure 1. Distribution of Slope Parameters 
for 12 Equatings Based on Eqtiating Each 
SAT-verbal Part Score to Itself 




-82- 





Methods Not Assuming 


Methods Assuming 




Invar lance 


Invar iance 






Maximum 


Invariant 






Spiralling 


Likelihood 


Link 


Ir variance 


8.90 










8.80 








• 

1 

• 


8.70 




• 


• 


t 


8.60 


• 
• 


• 
• 


• 




8.50 


• 


• 






8.40 










8c 30 


! 








8.20 












1 ^r...^ . ^ 1 





Figure 2. Distribution of Slope Parameters for 18 
Equatlngs Based on Equating Rights and Formula Scores 
for the Same Part of SAT-verbal 



5o 



-83- 



ylelds slopes that are relatively low. Maximum Likelihood and In- 
variant Link methods agree reasonably well with each other but yield 
somewhat higher slopes than those obtained by Spiralling. All six ot 
the slopes for the Invariance Method are relatively consistent, but 
exceed the highest of the 12 slopes obtained by the other three methods. 
This result raises some doubt about the practical usefulness of the 
Invariance Method as a way of relating Rights scores to Formula scores 
on the same test. The fact that the Invariance Method does not utilize 
data on the Formula scores earned when the test to be equated is given 
under Formula-scoring directions may contribute to the difference in 

results for this method. 

On the whole, the results for the slopes for the 30 equations developed 
for Form A of SAT-verbal suggest that the Invariant Link Method is an 
acceptable method for relatMg Rights and Formula scores, but raise questions 
about the confidence with which the Invariance Method can be used for this 
purpose. However, it should be noted that the Invariant Link Method may 
be less satisfactory when applied, under operational conditions, to samples 
drawn from students attending different test administrations then when 
applied, as it was in this study, to essentially random equating samples. 

The equating analysis of Total scores involved the same three basic 
methods for all three tests: Forms A and 3 of SAT-verbal and the Chemistry 
Test. First. Spiralling was used. Nexc, rhe Invariant Link Method was 
used, with the separately-timed scores on Part 2 as the link test for SAT- 
verbal and with a aubscore based on 25 embedded items that has been used in 



O Q r» 

ERIC 



-84- 



operatlonal equating as the link test for Chemistry. Finally, Formula 
scores were obtained for each group that had been tested under Rights 
directions In order to provide data for the equating by the Invarlance 
Method . 

Results for the equating of Total scores are presented in Table 15. 
Considering first the scaled scores at the mean of the experimental groups, 
It appears that, although the differences are not larg-j, the methods 
assuming Invarldnce yield slightly higher scaled scores for raw scores 
In these comparisons. The differences In scsled scores for maximum raw 
scores are In the same direction and quite large. At the chance score 
level, the methods assuming Invarlaiice yield somewhat lower scaled scores 
than does the Spiralling Method. As It turned out, differences in results 
for the three methods are relatively small for Form B of SAT-verbal. 

When slope parameters are considered, the two methods assuming In- 
varlance yield similar values for SAT-verbal, and these values are larger 
than those for the Spiralling Method. For Chemistry, the Invarlance Method 
yields a noticeably steeper slope than does the Invariant Link Method, and 
the Invariant Link Method yields a noticeably steeper slope than does the 
Spiralling Method. The fact that the Invariant Link Method yielded a 
slightly steeper slope th?n the Invarlance Method for Fona B of SAT-verbal 
is inconsistent with the pattern observed for the equatings of part scores. 
The equating studies show a tendency for equating methods that assume the 
Invarlance Hypothesis to overestimate slightly the slope parameters both 



97 



\ 



Table 15 



Conversion Parameters Relating Total Rights Scores on Forms A and B of SAT-verbal and 
Total Rights Scored on Chemistry to the College Board Scale, as DetermlneJ by 

Various Methods of EquatlnR 



Eq. 

No. 


Teat 


Form 


Group Tested 
uaing Rignto 
Directions 


Group Tested 
Using Formula 
Directions 


LiinK. 

Test 


Voiia t" 1 no 

Method 


Parameters 
Slope Intercept 


Scaled Score 

Raw (Rlkhts) Z 

^ a . b 
Chance «ean 


Wh^r 
>.ore is 

Perfect^ 


31 


SAT^erbal 


A 


1 


4 




Spiral 


8.1832 


70.7863 


210 


398 


766 


32 




It 


tt 


11 


2F 


Inv.Llnk 


8.7381 


53.1600 


202 


403 


796 


33 


«i 


It 


It 






Invar lance 


'8.7418 


52.6899 


201 


402 


796 


34 


SAT-verbal 


B 


5 


6 




Spiral 


8.2558 


62.7566 


203 


393 


764 


35 


It 


It 


ti 




2F 


Inv.Llnk 


8.3961 


58.7344 


201 


395 


772 


36 


11 


It 


It 






Invar lance 


8.3415 


61.4369 


203 


395 


770 


37 


Chemistry 




7 


8 




Spiral 


7.2568 


213.3274 




460 


866"^ 


38 






It 


11 


25 Item 
subscore 


Inv.Llnk 


7.6023 


205.2797 


342 


464 


889^ 


39 






tt 






Invar lance 


7.9311 


193.6605 


336 


463 


907^ 

V 



Chance score Is 17 for SAT^verbal; 18 for Chemistry 
^Mean score Is 40 (rounded) for SAT-verbal; 34 (rounded) for Chemistry 
^Perfect acore Is 85 for SAT-verbal; 90 for Chemistry 
^Wxlmum reported score Is 800. 



ERIC 



Q i 



-86- 



for SAT-verbal and Chemistry, as compared with standard equating methods. 
There Is a tendency for methods assuming the Invarlance Hypothesis'^ to 
yield slightly lower scaled scores at the lower end of the score scale, 
slightly higher scaled scores for the average student In the study sample, 
and sufficiently higher scaled scores at the high end of the scale to 
cause some concern, as compared with standard equating methods. On the 
other hand, these differences are not alarmingly great. Although they 
do not warr^int a confident assertion that use of the Invarlance Hypothesis 
will permit equating without any danger of a scale discontinuity, they do 
indicate that the discontinuity, if any, is likely to be within an 
acceptable tolerance. 



ERIC 



9D 



-87- 



Summary and Conclusions 
A brief review of the scope of the study and the characteristics 
of the student samples is relevant to the generalizability of the 
findings. With respect to scope, the tests administered and the 
equating methods used need to be considered. Two forms of SAT-verbal 
and^one form of the College Board Chemistry were administered. Because 
verbal abilities play a prominent role in the large-scale ETS admissions 
tests, the results for SAT-verbal should be applicable to other verbal 
tests. The Chemistry Test results provide information on a subject-matter 
achievement test. The standard equating methods used in the study in- 
cluded the Spiralling Method, the Maximum Likelihood Method, and a method 
based on equating a test to itself. These methods provided a standard 
against which equating methods that depended on the use of the Invariance 
Hypothesis could be compared. With cespect to student samples, the study 
was based on data for high school students who participated in the testing 
on a voluntary basis. It seems safe to assume that most of these students 
were reasonably sophisticated about strategies for taking Formula-scored 
and Rights-scored tests. The sample size for SAT-verbal was over 6,000 
and there were more than 1,000 students in each of the six spiralled 
samples. The sample size for Chemistry was over 2,300, and there were 
more than 1,»L50 students in each of the two spiralled groups. These 
sample sizes should be large enough to provide an adequate basis for 
investigating the various questions to which the studv was addressed. 



100 

ERIC 



-88- 



The data of the study permitted comparisons of Formula and Rights 
directions with respect to Number of Items Omitted, to Number of Items 
Not Reached, and to an index of guessing determined by subtracting the 
Number of Items Omitted from the Number of Wrong Answers. These com- 
parisons yielded differences between the two sets of directions that were 
in the expected direction but fairly small in magnitude. When students 
in the samples tested with Form A of SAT-verbal were stratified by ability, 
results indicated that under the conditions of the study, high-ability 
students appeared to respond more appropriately to differences in direc- 
tions than did lover-ability students. Under Rights directions, lower- 
scoring students tended to omit more items than high-scoring students. 
Under Formula directions, low-scoring students tended to omit fewer Items 
than higher-scoring students. These findings suggest that special efforts 
should be made to insure that ' students answer all items when tests are 
adr.inistered operationally using Rights directions. 

The design of the experiment for Form A of SAT-verbal also provided 
empirical data on two important, long-standing questions for which rigorous 
data were not hitherto available. First, it was found that when the same 
tests (Part 1 and Part 2 of SAT-verbal) were administered to one sample 
under Rights directions and to another, closely comparable sample under 
Formula directions, the (parallel-forms) correlation batween the two parts 
was higher under Formula directions. However, the difference was very 
slight. When internal cc:isistency— KR (2f i~reliability coefficients 
were calculated for the tests administered and scored under the two modes. 



ERIC 



101 



m 



-89- 



the Righcs-administered-and-scored tests were found to be more reliable 
than the Fonnula-admlnlstered-and-scored tests—in contradiction to the 
findings for the parallel-forms coefficients. But, agaia, the difference 
was very slight, indicating that, in general, the difference in reliability 
for the two modes of administration and scoring is nonexistent, or at 
most, of little consequence. Second, it was found that administering 
Parts 1 and 2 of SAT-verbal with the same directions yielded slightly ^ 
higher correlations than were obtained if the two parts wefe administered 
with different directions. When coefficients were corrected for reli- 
ability of measurement, the corrected coefficients were judged to be 
sufficiently close to 1.00 to justify the assumption of parallelism for 
purposes of equating. 

Results for the effect of differences in directions on mean scores 

S when Formula scoring is used provide some support both for the Differential 

•1 

Effect Hypothesis and for the Invar iance Hypothesis. The observed dif- 
ferences for SAT-verbal show slightly higher mean Formula .cores (about ") 
scaled score points) when Rights directions are used than when Formula 
directions are used, as predicted by the Differential Effect Hypothesis. 
On the other hand, the differences are, in general, too small to be 
statistically significant; and this is just as would be predicted by the 
Invariance Hypothesis. Moreover, results for Chemistry means are con- 
sistent with the l^ariance Hypothesis. 

The equating studies show that methods that make use of the Inviriance 
Hypothesis, as compared with standard equating methods, might result in a 



ERIC 



ERIC 



-90- 



small overestimate of the slope parameters both for SAT-verbal and 
Chemistry and a fairly large overestimate of scaled scores In the upper 
portion of the score scale. Thus, although these results do not provide 
a definitive basis for recommending that the Invarlance Hypothesis will 
permit equating without dahger of a scale discontinuity, they do Indicate 
that mean scores should remain reasonably comparable If methods of equating 
based on the Invarlance Hypothesis are used during the period of transition. 



10 



-91- 



A STUDY OF THE GRADUATE MANAGEMENT 
' ADMISSION TEST," BASED ON PROGRAM DATA 

g^uestions Addressed by the Study 
The principal purpose for which this study was undertaken was to 
investigate the effectiveness of several methods of equating scores 
that had been earned under conditions of Rights direcMons and scoring 
to scores earned under 'conditions of Formtila directions and scoring. 

In the course of studying the, equating methods it was deemed 
necessary to investigate other related questions: 

1. To what extent do the results provide a firm basis 
for choosing between the Invari'ance Hypothesis and the DifferetTfial^ 

Effect Hypothesis? 

The Invariance Hypothesis and the Differential Effect Hy- 
pothesis differ essentially in their predictions regarding how well 
students would perform if, instead of choosing to omit certain items 
wh'en tested under Formula directions, they chose to answer them. The 
Invariance Hypothesis Implies that their performance on the omitted 
Items would be, on the average, neither better nor worse than would be 
expected by chance. The Differential Effect Hypothesis, on the other 
hand. Implies that -^thelr performance on ihose items would be better, 
on the average, than would be expected by chance. If the Invariance 
Hypothesis is supported by the data. Formula scores would remain the 
same, on the average, whether or nbt the students chose to omit Items 
about which they had insufficient basis for answering. If, however. 



101 



-92- 

.he Differential Effect Hypothesis is supported, students who choose 
to omit certain items when tested under Formula directions would be at 
a disadvantage in comparison with other students of equal ability who 
answered all the items. 

Although the same student cannot take the same test under 
both Rights and Formula directions at the same time, it is possible to 
administer the same test so that one random half of a large group is 
tested with Rights directions and the other half is tested with Formula 
directions. The Invariance Hypothesis would predict that the two groups 
would have virtually equal mean Formula scores; the Differential Effect 
Hypothesis would predict that the group tested under Rights directions 
would have a higher mean Formula score than the group tested under 
Formula directions, 

2. To what extent do Formula directions affect the number of 
items Omitted, number of items Not Reached, and total number of items 
Not Attempted? 

3. When students are stratified on the basis of abili*:y, is 
there a discernible difference between high and low ability students in 
the effect of Formula and Rights directions on the average number of 
items Omitted, Not Reached, or Not Attempted? Do guessing indices 
defined as "Wrongs minus Omits" or by a formula devised by Ziller provide 
useful information about guessing tendencies that is not provided by the 
various indices of nonresponse? 



105 



/ 

-93- 

4. To whac extent do Formula and Rights directions yield 
different reliabilities? 

5. Is there reason to believe that the assumption of paral- 
lelism between a test administered with Rights directions and the same 
test administered with Formula directions is not warranted? 

6. How much confidence can be placed in the Invar iance Hy- 
pothesis as a basis for equating Rights scores to Formula scores? To 
what extent does the use of the Invarlance Hypothesis result in systematic 
differences between conversion lines obtained by assuming invarlance and 
corresponding parameters obtained by traditional equating methods? 



ERIC 



lOi 



Study Design 

In this experiment the variable section of an operational form 
of the Graduate Management Admission Test was used to study the effects 
of Rights and Formula directions. Utilizing these data, taken from 
the regular administration of the GMAT, offered several advantages over 
special administrations. In particular, the very large sample size made 
it possible to study five different item types under the two conditions 
of administration and to have large samples for each of the ten combina- 
tions With this arrangement, it was possible to use data for the 
operational part corresponding to each item type in the equating analyses. 
In addition, conducting the study as part of the GMAT Program provided 
realistic conditions of motivation and ensured that the sample was rea- 
sonably typical of GMAT examinees. On the other hand, conducting* the 
study as part of the program imposed certain restrictions on the study 
design. In particular, it was considered essential that the experiment 
be conducted in such a way that it would have no effect on an^ examinee's 
score of re^cord, and with the additional condition that no examinee could 
infer that the test material was experimental and would not cound toward 
his or her score. 

The examinees participating in this experiment undoubtedly antici- 
pated, correctly, that they would be given a Formula-directed test, and 
therefore it is entirely likely that some of them did not heed the Rights 
score directions as closely as they should have. If r.his was indeed the 



107 



-95- 



case, then it would follow that the experimental results have underestimated 
somewhat the effect of Rlghts'^directlons on the number of Items attempted 
as compared with what would happen if Rights scoring were used in opera- 
tional testing. Nevertheless, it seems reasonable that the effects 
observed in the experiment provide an adequate basis for assessing the 
impact of different directions on test scores. 

The experimental tests were all given as the last of the eight 
separately-timed sections at the October 1980 administration of GMAT. 
Each of five item types was administered as a 30-minute separately-timed 
part, and each of the resulting five experimental tests was administered 
with Rights directions to one grouo of students a..d with Formula directions 
to a different group of students. (Thus, there were ten different experi- 
mental tests.) In order to make the groups taking the different forms 
comparable in ability level, the tests were spiralled. The order in 
which the ten tests were packaged and distributed was as follows: 



Group 


Test 


Directions 


1 


Reading Comprehension 


Fotmula 


2 


Reading Comprehension 


Rights 


3 


Problem Solving 


Formula 


4 


Problem Solving 


Rights 


5 


Practical Business Judgment 


Formula 


6 


Practical Business Judgment 


Rights 


7 


Data Sufficiency 


Formula 


8 


Data Sufficiency 


Rights 


9 


Sentence Correction 


Formula 


10 


Sentence Correction 


Rights 



108 




-96- 



As in the SAT-verbal phase of this study, the same tests were 
administered under both Rights and Formula directions, and under 
conditions that permitted the comparison of equivalent groups of examinees 
who took the tests under one or the other of the two directions. As was 
noted in the report of the SAT-verbal study, these are the only instances, 
to our knowledge, in which a study of Rights and Formula directions was 
designed to permit such comparisons. 

Major concerns in planning the test administration were to provide 
appropriate Rights- and Formula-scoring directions for the experimental 
tests, in order to ensure that examinees read the directions for those 
tests and to ensure that they were taken under normal test-taking 
conditions. 

The Supervisor's liar lal for the test asked the supervisor to read 

aloud the following st a just preceding the administration of the 

experimental tests: 

During the next 30 minutes you ere to work only on 
Section 8. Read the directions at the beginning of 
Section 8 carefully and answer the questions in 
accordance with these airections. Turn to Section 8, 
read the directions and begin to work. 

Because the directions were relatively brief, it was decided that examinees 
could read the directions for their test within the time limits for the 
test. 

lie following statement was printed at the beginning of tests 
given with Formula directions: 

lOl) 

\ 



-97- 



Before answering the questions In this section, please 
review the following directions, which are the standard 
directions for GMAT. 

Students often ask whether they should guess when they 
are uncertain about the answer to a question. Your score 
or this section will be based on the number of questions 
you answer correctly minus a fraction of the number you 
answer Incorrectly. Therefore, It Is Improbable that 
random or haphazard guessing will change your score 
significantly. If you have some knowledge of a question, 
you may be able to eliminate one or more of the answer 
choices as wrong. It is generally to your advantage 
to answer such questions even though you must guess which 
of the remaining choices is correct. Remember, however, 
not to spend too much time on any one question. 

The following statement was printed at the beginning of tests given 

with Rights directions: 

Before answering the questions in this section, please 
read carefully the following directions, which apply 
only to this section. 

Your score on this test will be based on the number of 
questions you answer correctly. No deductions will be 
made for wrong answers. You are advised to use your 
time effectively and to mark the best answer you can 
to every question, regardless of how sure you are of 
the answer you mark. 

Although the Formula directions were somewhat longer, it was judged 
that this difference would be offset by the fact that the directions for 
these experimental tests were the same directions that had been given to 
the examinees in previous sections. 

Because all examinees who took the experimental tests took the same 
form of GMAT, and because each of the experimental tests corresponded to 
one of the parts of the operational test, it was possible to use the 



110 



-98- 



operatlonal test part scor^^j in analyzing the experimental test results. 

The operational test Included the following separately- timed parts: 

Number Number 
Part* Content of Items of Minutes 

1 Reading Comprehension 25 30 

2 Problem Solving 30 40 

3 Practical Business Judgment 20 20 

4 Data Sufficiency 30 30 

5 Practical Business Judgment 20 20 
7 Usage 25 15 

For purposes of analysis, scores on the two Practical Business Judgment 
tests were usually combined , and the Usage test was considered to be suf- 
ficiently similar to the Sentence Correction test to permit the use of 
either as a link test foi equating scores on the other test. 



*Part 6 was given as a 30-mlnute separately-timed part. It was composed 
of pretest itews and did not count in determining the examinee's score, 
nor was it used in the present ytudy. 



^ 11 J 



-99- 



Sample Characteristics 

The study sample was defined as all examinees for whom scores were 
reported on the October test and who attempted at least one Item on 
the experimental test. The total sample "Ize was 55,780, The ten sub- 
samples obtained by spiralling ranged In size from 5,408 for Group 5 
and 5,409 for Group 10 to 5,739 for Group 1 and 5,738 for Group 6. 
(The method of packaging and distributing the test books resulted in 
progressively smaller sample sizes from Group 1 to Group 5 and from 
Group 6 to Group 10.) Means of GMAT total scores for the ten groups were 
quite similar, ranging from 473 (Group 6) to 477 (Groups 7 and 8). 
The overall mean for the total sample was 475, with a standard deviation 
of 106. By comparison, GMAT candidates nested I rom November 1975 through 
July 1978 had a mean score of 461 witTh^a standard deviation of 107. 



ERIC 



112 



-100- 



Results 

Effect of Directions on Mean Formula Scores 

On logical grounds, the possible differential effect of directions 
on Formula scores depends on whether examinees would do better than would 
be expected by chance on Items that they answer under Rights directions 
but do not answer under Formula directions. Although It Is anticipated 
that the means of Rights scores would be cleanly higher for those tested 
under Rights directions than for those tested under Formula directions. 
It would be expected, undei^ the Invar lance Hypothesis, that this differ- 
ence Is caused by random responses to Items that normally would not be 
attempted under Formula directions. Under the Invarlance Hypothesis, 
the difference would be gireatly reduced. If not caused to vanish 
entirely. If a correction for guessing, as Is normally applied In 
Formula scoring, is used. Table 16 provides an opportunity to examine 
the validity of this hypothesis. 

As expected. Table 16 shows thac, except for the Practical Business 
Judgment part, the means of the Rights scores are higher for the groups 
that received Rights directions than for the groups that received Formula 
directions. This Is not true, however, for the means of the Formula scores 
Fomi»-la-score means are remarkably similar for the groups receiving the two 
types of directions on the tests of Reading Compiehension, Problem Solving, 
and Usage, and reasonably similar on the other two tests. As it happened, 
the mean Formula score is higher for those tested with Formula directions 

113 



Table 16 



Descriptive Statistics on Rights and Exact (Unrounded) Formula Scores 

for GMAT Experimental Tests 



Experimental Test 


Number 
of Items 


Directions 


N 


Rights 
Mean 


a 

Score 
S.D. 


Formula 
Mean 


a 

Score 
S.D. 


Reading Comprehension 


29 


R 


5658 


15.03 


6.21 


12.35 


7.11 


Reading Comprehension 


29 


F 


5739 


14.68 


6.46 


12.38 


7. 17 


Problem Solving 


25 


R 


5501 


9.67 


4.04 


7.56 


4.52 


Problem Solving 


Z5 


F 


5594 


9.00 


4.02 


7.57 


4.43 


Practical Business Judgment 


32 


R 


5738 


18.82 


4.84 


15.67 


5.87 


Practical Business Judgment 


32 


F 


5408 


18.95 


4.87 


15.85 


5.89 


l)a:a Sufficiency 


40 


R 


5590 


16.28 


5.33 


12.31 


5.91 


Data Sufficiency 


40 


F 


5657 


16.18 


5.46 


22.42 


5.32 


Sentence Correction 


30 


R 


5409 


17.05 


5.24 


14.44 


6.07 


Sentence Correction 


30 


F 


5486 


16.79 


5.40 


14.43 


6.03 



f 

^eans and standard deviations for which the scoring is consistent with the directions appe 
in italics. 



ERIC 



11 



-102- 



than for those tested with Rights directions in four of the five compari- 
sons. This result is contrary to the hypothesis that examjrjes would earn 
better than chance scores on items that they would not answer with Formula 
directions. On the whole, the results support the hypothesis that Formula 
scores are invariant with respect to directions. 
Effect of Directions on Nonresponse 

As described earlier in this report, the entire group of candidates 
tested in October 1980 was subdivided by spiralling into ten subgroups. 
Five of these subgroups each took a section of experimental items cor- 
responding in type to one of the five operational sections — Reading 
Comprehension, Problem Solving, Practical Business Judgment, Data Suf- 
ficiency, and Sentence Correction (Sentence Correction to correspond to 
the operational Usage section) — under Rights directions. The other five 
groups also each took one of the aforementioned experimental test sections, . 
but under Formula directions. Score intervals were formed in terms of the 
raw (Formula) scores on the operational test section, and means and standard 
deviations were tabulated of the number of items Omitted, the number of items 
Not Reached, and the number of items Not Attempted on the corresponding 
experimental secf.on for each stratum of students falling in those operational 
score intervals, as well as for all the students in all strata combineu. 
Means and standard deviations were also tabulated for each of the ten groups 
of students for the W-0 index for guessing and for the Ziller index of 
guessing, similarly stratified by score on the corresponding operational test 
section, A summary of the more significant tabulations of these data, as 



-103- 



they relate to the effects of directions on the numbers of Items attempted 
appears In Table 17 . More detailed tabulations appear in Tables 18-27 . 

It Is recalled that this report distinguishes between unanswered 
items that precede the last item answered (designated "Omitted") and 
items that follow the last item answered (designated "Not Reached"). In, 
general, it is reasonable to believe that examinees have considered 
Omlttod items and decided not to answer them. It is also reasonable to 
bplieve that examinees have not answered Items that are Not Reached 
because they had too little time to consider them. 

Table 17 shows the extent to which groups receiving Rights and 
Formula directions differed with respect to Items Omitted and items Not 
Reached. Examinees had higher means both on Items Omitted and items 
Not Reached under Formula directions than under Rights directions, as would 
be expected. The difference in items Omitted is relatively large for 
Problem Solving. On tho other hand, the difference for Practical Business 
Judgment is trivial. Considering Table 17 as a whole, it appears that, on 
the average, the effect of directions on the number of items Not Attempted 
(items Omitted plus items Not Reached) is relatively small. 

Tables 18-27 present the same data as in the suiranary table. Table 17, 
but in much more detailed form, separately by interval of score on the 
corresponding section of the operational section. In addition, as alre?dy 
Indicated, means and standard deviations of the W-0 and Ziller indices are 
given, similarly by sco.e interval on the corresponding operational section. 
Several observations may be made, in Tables 18-27, of the findings in 



o IIG 



Table 17 



Descriptive Statistics on Number of Items Omitted, Not Reached, and Not Attempted 

for GMAT Experimental Tests 



Experimental Test 


Number 
of Items 


Directions N 


Omitted 
Mean S.D. 


Not 
Reached 
Mean S.D. 


Not 
Attempted 
Mean S.D. 


Reading Comprehension 


29 


R 


5658 


1.32 


2 


.67 


1.94 


3.94 


3. 


26 


4.89 


Reading Comprehension 


29 


F 


5739 


2.44 


3 


.30 


2.66 


4.35 


5. 


09 


5.35 


Problem Solving 


25 


R 


55C1 


4.76 


4 


.84 


2.15 


3.51 


6. 


90 


5.71 


Problem Solving 


25 


F 


5594 


7.41 


4 


.36 


2.90 


3.84 


10. 


31 


4.41 


Practical Business Judgment 


32 


R 


5738 


0.44 


1 


.65 


0.13 


1.10 


0. 


57 


2.14 


Practical Business Judgment 


32 


F 


5408 


0.50 


1 


.72 


0.14 


1.10 


0. 


64 


2.22 


Data Sufficiency 


40 


R 


5590 


4.31 


5 


.04 


3.02 


4.78 


7. 


33 


6.66 


Data Sufficiency 


40 


F 


5657 


5.62 


5 


.43 


3.18 


4.89 


8. 


80 


6.69 


Sentence Correction 


30 


R 


5409 


1.03 


2 


.43 


1.47 


3.06 


2. 


50 


3.92 


Sentence Correction 


30 


F 


5486 


1.73 


2 


.98 


2.05 


3.54 


3. 


78 


4.49 



ERIC 



Table 18 



Number of Items Omitted, Not Reached, Not Attempted-, and Guessed 
on the Experimental Section of Reading Comprehension, Given with Rights Directions, 
Stratified by Formula Scores on the Operational Section of Reading Comprehension 



Guessing Index Guessing Index 



Operational Test 
Score Interval 

22-25 
20-21 
18-19 
16-17 
14-15 
12-13 
10-11 

8- 9 

6- 7 

4- 5 

2- 3 

0- 1 

Total 



No. of 


Omitted 


Not Reached 


Not Attempted 


(W-0) 


(ZilleirrZT 


Cases 






i ic all 


S .D. 


Mean 


S .D. 




S.D. 


Mean 


S.D. 


189 


0.29 


0.82 


0.25 


0.89 


0.54 


1.42 


5.14 


3.05 


0.96 


0.X3 


529 


0.46 


1.33 


0.56 


1.87 


1.02 


2.43 


6.12 


3.54 


0.96 


0.12 


835 


0.59 


1 .54 


0.71 


2 .05 


1 . 30 


2 . 71 


/ . by 


A 19 




ft 19' 


580 


0.93 


1 no 

1 .93 


u . yu 


z . Jo 


1 Ql 
1 . 




ft '\L 


•4 • D / 


0.93 


0. 14 


879 


1.09 


2.16 


1.48 


3.18 


2.57 


3.96 


9.49 


5.15 


0.93 


0.14 


640 


1.61 


2.73 


2.10 


3.90 


3.71 


4.72 


9.75 


6.03 


0.90 


0.16 


650 


1.78 


3.00 


2.40 


4.17 


4.19 


5.01 


10.92 


6.33 


0.90 


0.16 


514 


1.75 


2.87 


3.15 


4.91 


4.90 


5.50 


11.84 


6.41 


0.91 


0.15 


339 


2.71 


4.13 


3.38 


4.84 


6.09 


6.09 


10.67 


8.17 


0.86 


0.20 


257 


2.07 


3.44 


4.22 


5.62 


J. 28 


6.42 


12.69 


7.68 


0.89 


0.18 


124 


2.21 


3.48 


6.79 


7.02 


9.00 


7.80 


12.02 


9.05 


0.87 


0.20 


122 


2.95 


' 5.08 


4.69 


6.04 


7.64 


7.49 


12.66 


10.19 


0.87 


0.21 


5658 


1.32 


2.67 


1.94 


3.94 


3.26 


4.89 


9.39 


6.12 


0.92 


0.15 



Table 19 



Number of Items Omitted , Not Reached , Not Attempted, and Guessed > 
on the Experimental Section of Reading Comprehension, Given with Formula Directions, 
~Stf atlt1:€d^ bjr Formu^l^^coxes on the Operational Section of Reading Comprehensioh ^ 



GuessTng^ Index Guessing Index 



Operational Test 
«>core mLervox 


No. of 
Cases 


Omitted 


Not Reached 


Not Attempted 


(W-0) 


(Zlller) 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D, 


Mean 


S.D. 


Mean 


S.D. 


22-25 


186 


0.70 


1.44 


0.72 


2.20 


1.42 


2.92 


4.05 


3.02 


0.90 


0.19 


20-21 


586 


0.95 


1.82 


0.65 


1.90 


1.60 


2.72 


5.18 


3.46 


0.90 


0.17 


18-19 


879 


1.23 


2.01 


0.99 


2.44 


2.22 


3.25 


6.11 


4.06 


0.89 


0.16 


16-17 


568 


1.90 


2.50 


1.86 


3.30 


3.77 


4.15 


6.04 


4.72 


0.85 


0.18 


lA-15 


845 


2.10 


2.75 


2.16 


3.62 


«..26 


4.53 


7.20 


5.21 


0.85 


0.18 


12-13 


638 


2.93 


3.04 


3.08 


4.27 


6.01 


4.61 


6.92 


5.28 


0.82 


0.17 


10-11 


681 


3.16 


3.59 


3.21 


4.46 


6.37 


5.15 


7.75 


6.34 


0.82 


0.19 


8- 9 


519 


3.44 


3.83 


4.21 


5.00 


7.65 


5.55 


7.90 


7.17 


0.80 


0.20 


6- 7 


325 


4.28 


4.41 


5. 30 


5.52 


9.58 


5.61 


6.70 


7.30 


0.77 


0.21 


4- 5 


253 


4.06 


4.29 


5.16 


5.75 


9.22 


6.23 


8.44 


7.85 


0.80 


0.20 


2- 3 


142 


4.63 


4.53 


6.20 


6.23 


10.83 


6.16 


7.68 


7.68 


0.78 


0.19 


0- 1 


117 


4.03 


5.21 


6.34 


7.26 


10.37 


7.45 


9.15 


9.38 


0.80 - 


0.23 


Total 


5739 


2.44 


3.30 


2.66 


4.35 


5.09 


5.35 


6.79 


5.73 


0.84 


0.19 













- 


1 a u J. c 


20 


- 






1' 






Number of Items Omitted, Not Reached, Not Attempted, and Guessed 
on the Experimental Section of Problemr Solving, Given with Rights Directions, 
Stratified by Formula Scores on the Operational Section of Problem Solving 




operational Test No. of 
Score Interval Cases 


Omitted 


Not Reached 


Not Attempted 


Guessing Index 
(W-0) 


Guessing^ Index 
(Zlller) 


Mean 


S .D . 


Mean 


b . D . 


Mean 




Mean 


O m U m 


Mean 
• 


S.D. 


27-30 


11 


2.18 


2.25 


.45 


1.44 


2.64 


0 A 

2.4A 


0.45 


3.29 


0 . 59 


U . jO 


2A-26 


19 


2.26 


2.45 




53 


1.53 


2.79 


3.02\^ 


1.26 


4.33 


0.58 


0.36 


21-23 


56 


2.64 


2.77. 




50 


' 1.25 


3.14 


3.02 


1.66 


4.60 


0.67 


0.32 


18-20 


166 


2.99 


3.09 


1. 


13 


2.27 


4.12 


3.76 


2.07, 


5.61 


0.67 


0.32 


15-17 


359 


3 .80 


3.85 


1. 


12 


2.21 


4.^2 


4.32 


2.18 


6.8^' 


0.65 


0.33 


12-14 


713 


A. 14 


4.10 


1. 


57 


2.78 


5.fl 


4.68 ■ 


2.82 


7.44 


■ 0.67 


0.3]?- 


9-11 


1307 


4.49 


4.58 


2. 


18 


3.46 


6.66 


5.38 


3.46 


8.41 


0.68 


0.31 


6- 8 


, 1506 


5.24 


5.17 


2. 


51 


3.84 


7 .7S 


5.92 


3.-53 ' 


9.37 


0.67 


0.31 


> 5 


960 


5.53 


5.31 


2 . 


57 


3.79 


8.10 


6.33 


4.53 


9.94 


0.68 


0.29 


. 0- 2 


404 


5.12 


5.43 


2. 


42 


3.99 


• 7.54 


6.57 


6.75 


10.17 


0.73 


0.27 


Total 


5501 


H , lb 




2 


15 


1 SI 


6 90 


5 71 


3.67 


8.86 


0.68 


0.30 

» 


ERLC; 








1 




12 


n 




• 




• 





Table 21 



Number of Itesm Omitted, Not Reached, Not Attempted, and Guessed 
on the Experimental Section of Problem Solving. Given with Formula Directions. 
Stratified by Formula Scores on the Operational Section of Problem Solving 



Operational Test No. of 
Score Interv al^ Cases 

27-30 .1 

2A-26 32 

21-23 70 

1-8-20 163 

15-17 357 

1?-1A 7A8 

9-11 1346 

6- 8 1518 

3- 5 966 

0- 2 383 

Total 5594 



Omitted Not Reached 



Mean ^.D. Mean S.D . Mea 



2.64 
2.97 
3.70 
4.75 
5.89 
6.39 
7.09 
8.06 
8.81 
8.12 



2.01 
2.70 
2.91 
2.96 
3.10 
3.55 
3.97 
4.40 
4.88 
5.09 



■0.45 
0.38 
0.53 
1.06 
1.64 
2.48 
3.14 
3.33 
3.28 
2.98 



0.89 
0.70 
1.08 
1.80 
2.60 
3.25 
3.85 
4.13 
4.21 
4.02 



Not Attempted 
Mean S.D. 



3.09 
3.34 
4.23 
5.80 
7.53 
8.87 
.10.22 
11.39 
12.09 
11.10 



2.23 
2.84 
2.91 
3.21 
3.04 
3.33 
3.73 
4.13 
4.67 
5.43 



Guessing Index 
(W-0) 
Mean S.D. 



-0.73 
-0.34 
-0.23 
-1.20 
-1.93 
-1.87 
-1.90 
-2.17 
-1.89 

1.0'4 



1.76 
4.01 
4.10 
4.28 
4.36 
5.20 
6.01 
6.69 
7.70 
8.87 



Guessing Index 
(Ziller) 
Mean S.D. 



7.41 4.36 2.90 3.84 10.31 4.41 



1.71 6.54 



0.45 
0.52 
0.56 
0.49 
0.45 
0.47 
0.48 
0.48 
0.50 
0.58 

0.49 



.29 
.28 
.27 
.27 
.23 
.24 
.24 
.23 
"24 
.24 



.24 



ERIC 



121 



, Table 22 

Number of Items Omitted, Not Rearhed, Not Attempted, and Guessed 
on the Experimental Section of Practical Business Judgment, Given with Rights Directions, 
Stratified by Formula Scores on the Operational Sect:tpn of Practical Business Judgment 



Guessing Index Guessing Index 



Operational Test 


No. of 


Omitted 


Not Reached 


Not Attempted 


(W-0) 


(Ziller) 


Score Interval 


Cases 


Mean 


S.lJ. 


Mean 


S.D. 


Mean 


S.D. 


Mean' 


S.D. 


Mean 


S.D. 


36-40 


19 


u • uu 


U.UU 


A AA 
0 .00 


0.00 


0.00 


0. 00 


4.58 


2.80 


1.00 


0.00 


32-35 ' 


124 


0.05 


0.21 


0.02 


0.13 


0.06 


0.25 


5.72 


2.89 


0.99 


0.02 


28 31 


434' 


0.11 


0.73 


0.08 


1.30 


0.19 


1.53 


18.39 


3.83 


0.99 


0.06 


24-27 


906 


0.i6 


0.64 


0.02 


0.18 


0.18 


•0.68 


10.10 


3.80 


0.99 


0.06 


20-23 


1234 


0.20 


0.94 


0.04 


0.40 


0.24 




11 f>7 


■J • i J 






16-19 


1321 


0.27 


0.92 


0.05 


0.52 


0.32 


1.13 


12.99 


3.68 


0.99 


0.05 


12-15 


869 


0.64 


1.75 


0.13 


0.82 


0.77 


2.04 


3.84 


4.44 


0.97 


0.09 


t s-iy 


501 


0.91 


2.25 


0.19 


0.95 


1.10, 


■ 1^.68 


14.78 


5.33 


0.96 


0.11 


4- 7 


242 


1.66 


3.2, 


0.50 


2.01 


2.16 


3.98 


16.35 


6.65 


0.93 


0.13 


0- 3 


88 


3.77 • 


6.04 


2.53 


5.88 


6. -31 


8.53 


13.47 


11.78 


0.84 


0.25 


Total 


5738 


0.44 


1.65 


' 0.13 


1.10 


0.57 


2.14 


12.17 


4.89 


0.98 • 


0.08 


















• 









122 



Table 23 



Number of Itesm Omitted, Not Reached, Not Attempted, and Guessed 
on the Experimental Section of Practical Business Judgment, Given with Formula Directions, 
Stratlfled^by Formula Scores on the Operational Section of Practical Business Judgment 



Operational Test 
Score Interval 



Mo. of 
Cases 




Total 



5408 



Omitted 


Not Reached 


Not Attempted 


Guessing Index 
(W-0) 


Guessing Ind«j 
(Zlller) 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


0.12 


0.32 


0.00 


0.00 


0.12. 


0.32 


4.65 


2.03 


0.98 


0.04 


0.09 


0.34 


0.00 


0.00 


0.09 


0.34 


6.57 


3.77 


0.99 


0.04 


0.10 


0.60 


0.01 


0.08 


0.11 


0.60 


8.23 


3.48 


0.99 


0.05 


0.20 


0 ^ 


0.07 


0.82 


0.27 


1.45 


9.83 


3.90 


0.98 


0.07 


0,21 


0.81 


0.03 


0.45 


0.24 


1.11 


11.48 


3.66 


0.99 


0.05 


0.36 


1.13 


0.07 


'^.SO 


0.43 


1.34 


12.64 


3.80 


0.98 


0.06 


0.59- 


1.77 


0.10 


0.63 


0.69 


1.93 


. 13.69 


4.34 


0.97 


0.08 


1.13 


2.39 


0.21 


0.94 


1.34 


2.69 


14.87 


5.18 


0.95 


0.11 


1.97 


3.47 


0.89 


3.46 


2.86 


4.82 


15.31 


7.06 


0.92 


0.14 


3.72 


5.63 


1.73 


4.07 


5.44 


7.52 


13.71 


10.58 


0.85 


0.22 


0.50 


1.72 


0.14 


1,10 


0.64 


2.22 


11.91 


4.84 


0.97 


0.08 



ERIC 



23 



Table 2k 



Number of Items Omitted, Not Reached, Not Attempted, and Guessed 
on the Experimental Section of Data Sufficiency, Given with Rights Directions, 
Stratified by Formula Scores on the Operational Section of Data Sufficiency 



Operational Test 
Score Interval 

27-30 
24-26 
21-23 
18-20 
15-17 
12-14 
9-11 
6- 8 
3- 5 
0- 2 

Total 



No. of 
Cases Mean 



19 

143 
261 
591 
658 
889 
1033 
879 
662 
455 

5590 



Not Reached 

S.D. Mean S,".D. 



Omitted 



1.68 
1.91 

2 31 
2.93 
3.68 
4.19 
4.79 
5.33 
5.11 
5.09 



2.64 
2.80 
3.07 
3.77 
4.09 
4.70 
5.12 
5.75 
5.81 
5.79 



0.53 
1.69 
2.41 
2.60 
3.25 
3.24 
3.23 
3.29 
2.82 
2.96 



1.39 
3.25 
3.95 
4.09 
4.93 
4.90 
4.96 
5.07 
4.69 
5.09 



Not Attempted 
Mean S.D. 



2.21 
3.59 
4.71 
5.53 
6.92 
7.43 
8.02 
8.62 
7.94 
8.05 



3.09 
4.48 
4.91 
5.35 
6.15 
6.44 
6.83 
7.04 
7.21 
7.39 



Guessing Index 
(W-0) 
Mean S.D. 



4.31 5.04 



3.02 4.78 



7.33 6.66 



8.05 5.30 
8.74 5.32 
9.90 5.83 
10.45 6.83 
11.00 7.83 
11.25 8.65 
11.81 9.46 

12.02 10.44 
14.29 11.08 

16.03 11.02 

11.97 9.35 



Guessing Index 
(Zlller) 
Mean S.D. 



0.88 
0.87 
0.8- 
0.85 
0.83 
0.82 
0.81 
0.80 
0.82 
0.84 



0.17 
0.17 
0.16 
0.17 
0.18 
0.19 
0.19 
0.20 
0.20 
0.18 



0.83 0.19 



Table 25 



Number or Items Omitted, Not Reached, Not Attempted, and Guessed 
on the Experimental Section of Data Sufficiency, Given with Formula Directions, 
Stratified by Formula Scores on the Operational SectloiTW-Data Sufficiency 



[Operational Test No. of 

1 Score Interval Cases 

27-30 24 

24-26 154 

21-23 276 

18-20 575 

15-17 687 

12-14 846 

9-11 1064 

6- 8 316 

3- 5 726 

0- 2 489 



Omitted Not Reached 



Mean S.D. Mean S.D. 



1.29 1.99 

1.64 2.61 

2.96 3.37 

2.94 3.33 

4.86 4.71 

5.48 4.87 



5.70 
7.02 



5.17 
5.69 



7.33 6.17 
8.04 6.80 



1.83 
1.42 
2.82 
2.80 
3.49 
3.64 
3.44 
3.32 
3.15 
2.44 



3.48 
2.99 
4.27 
4.21 
4.88 
5.17 
5.14 
5.05 
5.25 
4.33 



Not Attempted 
Mean S.D. 



3.13 
3.06 
5.78 
5.74 
8.35 
9.13 
9.15 
10.34 
10.48 
10.47 



3.55 
3.87 
5.03 
5.24 
6.16 
6.24 
6.57 
6.65 
7.35 
7.53 



Guessing Index 
(W-0) 
Mean S.D. 



7.00 
9.42 

10.03 
8.60 
8.51 
9.70 
8.95 



3.20 
A. 49 



5.93 
8.08 
8.32 
9.09 
9.81 



9.87 10.88 
11.26 12.26 



Guessing Index 

(Zlller) 
Mean S.D. 



0.91 
0.90 
0.84 
0.85 
0.78 
0.76 
0.77 
0.74 
0.75 
0.75 



0.13 
0.14 
0.18 
0.16 
0.19 
0.19 
0.19 
0.19 
0.20 
0.21 



Total 



5657 



5.62 5.43 3.18 4.89 8.80 6.69 



9.39 9.08 



0.78 0.19 



Table 26 



Number of Items Omitted, Not Reached, Not Attempted, and Guessed 
on the Experimental Section of Sentence Correction, Given with Rights Directions, 
Stratified by Formula Scores on the Operational Section of Usage 



Operational Test No. of 
Score Interval Cases 

22-25 140 

20-21 274 

18-19 448 

16-17 333 

14-15 628 

12-13 589 

10-11 748 

8- 9 778 

6- 7 545 

4- 5 502 

2- 3 266 

0- 1 158 

Total 5409 



Omitted 
Mean S.D. 



0.21 0.72 

0.27 0.85 

0.39 1.09 

0.67 1.84 

0.51 1.40 

0.93 1.93 

0.87 1.93 

1.11 2.35 

1.67 3.18 

1.81 3.49 

1.87 3.63 

2.22 4.08 



Not Reached 
Mean S.D. 



0.14 
0.41 
0.62 
1.02 
1.07 
1.31 
1.46 
1.65 
2.12 
2.18 
2.72 
2.59 



0.63 
1.46 
1.82 
2.28 
2.43 
2.78 
3.10 
3.26 
3.54 
3.70 
4.03 
4.19 



Not Attempted 
Mean S.D. 



0.36 
0.68 
1.01 
1.68 
1.58 
2.24 
2.33 
2.76 
3.79 
3.99 
4.59 
4.81 



0.96 
1.74 
2.23 
2.89 
2.86 
3.38 
3.61 
4.02 
4.62 
4.85 
5.11 
5.60 



Guessing Index 
(W-0) 
Mean S.D. 



4.85 
5.93 
6.81 
7.20 
8.76 
9.08 
9.78 
10.99 
10.40 
11.36 
11.74 
12.73 



2.78 
2.87 
3.27 
3.95 
3.95 
4.35 
4.53 
5.25 
6.04 
6.60 
6.94 
8.38 



Guessing Index 
(Ziller) 
Mean S.D. 



0.95 
0.97 
0.96 
0.94 
0.96 
0.94 
0.94 
0.93 
0.91 
0.91 
0.91 
0.89 



0.18 
0.09 
0.11 
0.14 
0 10 
0.12 
0.12 
0.13 
0.16 
0.17 
0.16 
0.19 



1.03 2.43 1.47 3.06 2.50 3.9' 



9.42 5.36 



0.93 0.14 



ERIC 



12G 



Table 27 



Number of Items Omitted, Not Reached, Not Attempted, and Guessed 
on the Experimental Section of Sentence Correction, Given with Formula Directions, 



Stratified by Formula Scores on the Operational Section of Usage 



Operational Test No. of Omitted 



Score Interval 


Cases 


Mean 


S.D. 


22-25 


121 


0.49 


1.20 


20-21 


279 


0.41 


1.07 


18-19 


472 


0.75 


1.67 


16-17 


353 


1.05 


1.92 


14-15 


648 


0.84 


1.73 


12-13 


588 


1.68 


2.56 


10-11 


822 


1.70 


2.87 


8- 9 


741 


1.85 


2.78 


6- 7 


555 


2.79 


3.90 


A- 5 


483 


2.92 


3.84 


2- 3 


276 


3.12 


4.24 


0- 1 


148 


3.08 


4.50 


Total 


5486 


1.73 


2.98 



Not Reached 
Mean S.D. 



0.44 
0.61 
0.85 
1.27 
1.08 
1.62 
1.98 
2.61 
2.94 
3.43 
4.06 
3.80 



1.34 
1.75 
2.25 
2.65 
2.41 
2.94 
3.32 
3.76 
3.98 
4.54 
5.03 
4.53 



Not Attempted 
Mean S.D. 

0.93 ^1.88 
1.02 



1.60 
2.32 
1.92 
3.30 
3.68 
4.46 
5.74 
6.35 
7.18 
6.89 



2.09 
2.94 
3.16 
3.01 
3.70 
4.21 
4.42 
4.87 
5.14 
5.77 
5.62 



Guessing Index 
(W-0) 



Guessing Indes 
(Zlller) 



2.03 3.54 3.78 4.49 



Mean 


S.D. 


Mean 


S.D. 


4.39 


2.72 


0.94 


0.14 


5.25 


2.92 


0.94 


0.16 


6.10 


3.60 


0.92 


0.16 


6.53 


3.68 


0.91 


0.14 


7.79 


4.00 


0.93 


0.13 


7.45 


4.56 


0.88 


0.16 


8.31 


5.21 


0.89 


0.17 


8.77 


5.29 


0.88 


0.16 


8.05 


6.36 


0.85 


0.19 


8.15 


6.42 


0.84 


0.19 


8.76 


6.81 


0.84 


0.19 


10.15 


8.10 


0.85 


0.20 


7.70 


5.26 


0.89 


0.17 



ERIC 



1 o'- 



-115- 



the columns of numbers of items Omitted. One, as expected, there is a 
decline in the mean number of items Omitted as a function of score on the 
corresponding operational test; the higher the score on the operational 
test, the fewer the Omits on the corresponding experimei^tal section. 
Two, also as expected (and as pointed out above), there are more Omits 
for those taking the test under Formula directions than for those taking 
the test under Rights directions. This difference is clearly in evidence 
on all of the five experimental tests except for Practical Business Judgment, 
in which the difference, while still in favor of those tested under Formula 
directions, is very small indeed. Three, the progressions of mean Omits 
for the two types of directions track each other very closely in most of 
these five tests, especially in the case of Practical Business Judgment, 
.where they follow almost precisely the same levels as well as the same 
patterns. Four, in all five test sections the differences in the mean 
number of items Omitted for the two types of directions become smaller 
with increases in ability; the higher the score on the operational test, the 
smaller is the difference in the mean Omits for candidates tested under 
Formula and Rights directions. This last finding. It is noted, is in con- 
flict with the finding in the SAT phase of the study, in which it was 
observed that differences in omitting behavior were more pronounced at 
higher levels of ability than at lower levels of ability, not less pronounced. 
Whether the difference between the two studies is a function of the age and 
level of sophistication of the students is a matter for speculation. 

The results shown by the tabulations of the number of items Not Reached 



ERIC 



-lie- 



match directly those shown by the tabulations of the number of items 
Omitted. One, the number of items Not Reached declines with score on 
the corresponding operational section of the GMAT. This too is to be 
expected, since the count of Not Reached (NR) is often taken as a measure 
of speededness, and speededness is expected to correlate negatively with 
ability. Two, the tests administered under Formula directions show 
higher average NR counts than tests administered under Rights directions. 
This too is expect d; Formula-directed tests are likely to take more time 
per item and to result in greater numbers of items Not Reached than Rights- 
directed tests. In addition, because it is to the student's advantage 
under Rights directions to answer every item, some examinees undoubtedly 
used blind guessing or supeificial considerations in answering items near 
the end of the tests. Three, also as in the study of Omits, the progression 
of mean NR counts for Formula-directed tests and the progression of NR counts 
for Rights-directed tests track each other surprisingly closely, especially 
so (again) for the test of Practical Business Judgment. Four, the differ- 
ences between mean NR counts for Formula-directed tests and mean NR counts 
of Rights-directed tests decline as a function of increasing ability, as 
measured by the score on the corresponding operational test section. 

The tabulations of the mean number of items Not Attempted (NA) show 
the same pattern as shown by the tabulations of the numbers of items Omitted 
and Not Reached. This too is expected, inasmuch as the NA count is the 
simple sum of the counts of Omitted and Not Reached items. As in the case 
of its component counts, (a) the number of items Not Attempted declines as 



-117- 



the score on the corresponding operational test section rises; (b) the 
NA count is clearly and consistently higher for Formula-directed tests 
than for Rights-directed tests except in the case of Practical Business 
Judgment, for which the two trends are very close and almost indistinguish- 
able; (c) the progression of the decline for the NA count in the Rights- 
administered tests tracks very closely the progression of the decline for 
the NA count in the Formula-administered tests, and as was just pointed out, 
they are virtually indistinguishable in the case of the Practical Business 
Judgment test; and (d) the difference in the NA count decreases with ability, 
as measured by the operational section score. 
Effect of Directions on Guessing 

In the present phase of the study, two indices of guessing were 
studied. One of these indices is the index, W-0 (the number of items 
answered incorrectly minus the number of items omitted), examined in 
the study of the SAT. The other is the index advanced by Ziller (1957): 

2 = rk/(k-l)] W , 
[k/(k-l)] W + NA ^ 

where k no. of response options per item, 

W « no. of items answered incorrectly, and 

NA -= no. of items Not Attempted (Omitted plus Not Reached). 

Both of these indices are offered for consideration because both appear 

to benefit from a defensible rationale. At the same tiwe, it should be 

' pointed out that both suffer from certain deficiencies. Both indices. 



130 



-118- 



it Is observed, are derived from the responses, and nonresponses , to 
the Items In the test. The justification for both Indices Is the justi- 
fication described In the report of. the SAT phase of this study: that the 
number of Wrongs Includes all Items for which it can safely be assumed 
that the student iiad less than complete knowledge. Assuming that he 
(she) made no clerical errors in responding, it is presumed that anyone 
who responds without complete knowledge does so with at least some degree 
of guesswork — excep^ perhaps, for those students who respond with con- 
fidence but ;^lth Incorrect knowledge or information. ^ 

As suggested earlier in this report, it is reasonable to believe 
that there should be no correlational relationship between the tendency 
to guess, when taken as a personality trait, and cognitive ability in the 
abstract sense. On the other hand, there is good reason to believe that 
the act of guessing does affect the test score and therefore should corre- 
late with it (L. R Tucker, personal communication). The question remains, 
to what extent should there be such a correlation, and should the corre- 
lation be positive oT negative? Related to this question might be the 
question, should we be able to anticipate the nature of the correlation 
from the nature of the inde^c itself? 

Aa described above, the rationale for the index W-0 is that W Includes 
all items for which it may be asf timed that the student had less than complete 
knowledge, and it is presumed that, e^ccept for the instances of confident, 
but Incorrect, responses, the student who responds without complete 
knowledge does so with some degree of guesswork. The subtraction of Omits 




-119- 



is introduced as evidence of a tendency not to guess. It might be argued 
that there is a deficiency in the W-0 index, arising principally from the 
fact that it does not control for "opportunity," That is, leaving aside 
the NR' count — ^which is taken to be the number of items that the student 
does not have time to consider, and therefore does not represent either 
a decision to guess or a decision not to guess — the index W-0 is largely 
a function of score level. Thus, high-a^^lity students will necessarily 
earn a low W-0 index of guessing, and thic, can be predicted from an exam- 
ination of the index itself; in effect, the student's level of ability 
coerces the numerical value of the guessing index. On the other hand, 
if we are searching for an index of guessing behavior on the test, as we 
are, then the W-0 index becomes far more attractive. High-ability students 
have a low W-0 value because they do not guess. For obvious reasons they 
do not have to guess. The index, then, is admittedly not a measure of 
their general propensity to guess; it is a measure of the amount of 
guessing they actually do in the course of taking that test. 

There is one adjustment that might have been introduced into the 
W-0 index, but was not. This is to increase the W component by the factor 
k/(k-l) (where k - number of response options per item), to account for 
the fact that some of the student's guessing actually resulted in correct 
responses. Although the factor k/.(k-l) may well constitute an over- 
correction — because many correct responses are the result of partial in- 
formation and therefore only partial g»iessing~it is also clear that the 
unadjusted index, W-0. is a slight underestimate of the actual guessing 



ERLC 



132 



r 



■120- 



behavj^r. On the other hand, it Ir* probably not enough of an under- 
estimate to change, the results of this study In any significant way. 

The Zlller Index, It Is noted, does contain the factor k/(k-l), 
and expresses the Index of guessing as a proportion. In which the 
numerator represents the number of times the student did guess, and in 

whic"?! the denomina^^T"'^1repr an "Ignorance" score, a score represent- 

\ 

Ing the maximum number of items- at which the student might have guessed. 

Indeed, the denominator is algebraically identical to the total number of 

W ' 

items minus the conventional Formula score, R - . In £hls sense, the 
index is properly characterized as an attempt to describe a general 
tendency on the part of the student, transcending his (her) actual be- 
havior on the particular test. This being the case, one would expect— 
and the data cited below confirm this expectation— that the Zlller index 
will show a lower correlation with score level than the W-0 index. 

There is no fundamental conflict between the tvo indices; they are 
intended to express different types of measures. Nevertheless, it may 
also be in order to suggest some possible alterations in the Zlller index, 
as we did with the W-0 index. First, it would be useful to Include in 
the numerator a subtraction for Omits, as is done in the W-0 index, 
because the omission of an item indicates a tendency not to guess. (Note 
that this change would make the numerator in the Ziller index identical to 
the W-0 index, once che W in that index is weighted by the factor, k/(k-l).) 
Second, in consideration of the intent of the denominator, which is to express 
the number of opportunities to guess, it would be preferable to confine the 



133 



nonresponsed in the sedond term to Omits only, on the basis that the NR 
items are those that the student has never had time to consider. Thus, 
the suggeStea revision of the-Zlller Index might be the following: 

rWdc-l)] W - 0 ^ ^ , . ' 

^ [k/(k-l)] W + 0 

• It was expected in this phase of the study, as in the SAT phase of 
the study, that the W-0 index would correlate negatively with score level 
because W, the dominant factor in' the index, would certainly correlate 
negatively with score level. It was not known, however, how the Ziller 
index would correlate, and whether the correlation might yield any in- 
sights regarding the characteristics of Rights and Formula directions. 
Table 28 below is designed to offer some infonaation in these respects. 
The first section of this table gives correlations of Wrongs, Omits, W-0, 
and Ziller, earned on an experimental section (when administered and scored 
by Rights and when administered and scored by Formula) with the corresponding 
j^ormula-scored operational section of the test. The second section of the 
table gives the coi^relations of the same four variables with the experi- 
mental test '^core from which the variables themselves were derived. The 
correlations in this section of the table describe relationsh^.ps between 
variables that are based, in part, on the same data. As expected, both the 
Wrongs scores and the Omits scores are without exception negatively 
• correlated with both the operational and the experimental test sections. 
Also as expected, the W-0 Index is, with only two exceptions, negatively 
correlated with the operational scores and with the scores on the experimental 



Table 2a 



Correlations of Wrongs, Omits, and Two Indices of Guessing on Experimental Tests 
with Scores on Operational and Experimental Tests 



Experimental 
Test 



Directions 



Correlations of Responses on Experimental Tests with! 



No. of Operational Test Score 
Cases Wrongs Omits W-O Ziller 



Experimental Test Score 
Wrongs Omits W>0 Ziller 



Reading Comp. 


Rights 
Formula 


5658 
5739 


-.497 
-.451 


-.249 
-.332 


-.310 
-.164 


.175 
.204 


-.644 
-.675 


-.352 
-.391 


-.388 
-.305 


.258 
.212 


Prob. Solving 


Rights 
Formula 


5501 
5594 


-.330 
-.347 


-.133 
-.257 


-.117 
-.026 


-.048 
-.064 


-.234 
-.531 


-.374 
-.290 


.069 
-.1Q9 


.159 
-.162 


Pract. Business 
Judgment 


Rights 
Formula 


5738 
5408 


-.562 
-.568 


-.259 
-.284 


-.432 
-.422 


.213 
.232 


-.898 
-.926 


-.327 
-.294 


-.718 
-.749 


.243 
.220 


Data Sufficiency 


Rights 
Formula 


5590 
5657 


-.433 
-.393 


-.179 
-.301 


-.172 
-.051 


.067 
.175 


-.287 
-.443 


-.342 
-.311 


.006 
-.075 


.245 
.175 


Sentence Corr.^ 


Rights 
Formula 


5409 
5486 


-.525 
-.453 


-.213 
-.272 


-.343 
-.192 


.135 
.181 


-.687 
-.684 


-.361 
-.369 


-.412 
-.314 


.248 
.226 



^Scores on the operational tests are unlfonnia^ Formula scores. Scores on the 
experimental tests are consistent with the directions: Rights scores with 
Rights directions and Formula scores with Formula directions. 

^The operational score corresponding to the experimental Sentence Correction 
^ . section is Usage. 



-123- 



te*tB. The range ofe the correlations with the operational scores extends 
from -.026 to -^432. The range of the correlations with the experimental 
scores, leaving aside the Practical Business Judgment test for the moment, 
extends from -.412 to +.069. The correlations of W-0 with the experi- 
mental Practical Business Judgment test are extremely high negative In 
comparison with the others (-.718 and -.749) undoubtedly because the means 
and standard deviations of Omits shown In Tables i2 and 23 are so small 
that W-0' becomes virtually equal to W. These correlations are Indeed not 
much lower than the correlations of the Wrongs with the experimental 
test scores (-.898 and -.926). Referring again to the column of correla- 
tions of W-0 with the operational test scores, we see that the correlations 
with Practical Business Judgment are again high negative. Reference to the 
operational test analysis for that foi?n confirms that that operational 
test also had very few Omits, again suggesting that the correlation is to a 
considerable extent the correlation between complementary scores. Here too, 
the correlations of the W-0 socres with the operational test scores (-.432 
and -.422) ece not much lower than the correlations of the Wrongs scores 
with the opferatlonal test scores (-.562 and -.568). 

■ The correlations of the Zlller Index with the operational test scores 
and \ilth the experimental test scores are generally positive and smaller, 
In absolute size, than the corx espondmg correlations for the W-0 Index. 
This xa expected^, as Indicated earlier, because, unlike the W-0 InJex, 
yhlch Is a direct functloh of the student's behavior on the test, the Zlller 
Index, In ex^iresslng the "amount of guessing" as a proportion of the 



136 



-124- 



"opportunity for guessing," attempts to actileve a measure of the more 
genet^l Vtendency to gue8s." 

Returning to Tables 18-27, It is noted that the same tabulations as 
those made for the counts of Items Omitted, Not Reached, and Not Attempted 
were made fot the two gu-sslng Indices. As In the results of the preceding 
thre^^counts, the curve of the W-0 Index Is seen to follow a declining 
pattern, a pattern which Is expected on theoretical grounds, and also 
expected in view, of the observed negative correlation of W-0 and test 
scores. Second, as In the tabulations of 0, NR. and NA, the level of 
guessing Is clearly greater for Rights-directed tests than for Formula- 
directed tests, except for the test of Practical Business Judgment, where 
the amount of guessing, as measured by the W-0 index, is virtually the 

» 

same f or ^ two modes of directions. Third, the curves of the two sets 
of W-0 means t^ack earl, other very .closely, especially, again, for the 
test of Prictical Bi^siness Judgment. Finally, the mfean values of the index 
approach each other as one moves up'the scale of ability, rather than diverge 
from each other, as was observed in the SAT phase of the study. 

The Ziller index of guessing behaves ve y much as does the W-0 index, 
except that it tends to rise, rather than decline, with the score on the 
corresponding operational section. But like the W-0 index, it shows 
^-nerally. higher mean values for Rights- than for Formula-directed areas, 
except 'or the Practical Business Judgment test where the means are very 
similar; and it shows generally- the same fluctuations in the progression 
of its means with "categories of score on the operational test. Unlike the 



-125- 



W-0 index, there Is no clear tendency for the two sets of means either 
to converge or diverge as a function of score on the operational test. 
Effect of Directions and Scoring on Reliability and Parallelism 

Because the GMAT study was performed as part of an operational test 
administration, it was not possible to vary the directions for the 
separately-timed parts as was done in the SAT-verbal^ experiment . However, 
the data provided by the GMAT study do permit the comparison of parallel- 
forms correlations between tests both of which were administered and 
scored with the same (Formula) directions with parallel-forms correlations 
between tests administered and scored with different directions. They . 
also permit the comparison of KR (20) reliabilities under the two conditions 
of administration and scoring.* Finally, they permit the comparison of 
true-score correlations between two parallel tests in evaluating the 
question whether a -change from one type of administration and scoring to 
another might not cause an extensive shift in the nature of the ability 
measured . 

It is recalled that the administration that permitted the tabulation 
of these data was a regular ?d'ninistration of the GMAT. in October 1980, 
at which time certain conditions had to be met in order to provide re- 
portable scores for the students sitting for the tests at that adminis- 
tration. Clearly, the operational scores—the scores of record—had to be 
earned under Formula directions. Second, the tests administered under 
different directions, were composed of items that were being pretested 
for possible operational use. Such Items cannot be expected to be of the 



*Por a discussion of certain logical considerations in interpreting 
® - these reliability estimates, see pages 63 and 64. 

"~" 138 



-126- 



uniformly high quality that characterizes the items in the operational 
forms of the GMAT. Third, one of the five experimeirtfel tests used in 
the study, Senten^te Correction, was not parallel to the corresponding 
Usage test, as would be ideal in a study of this sort. These foregoing 
considerations are relevant to the interpretation of the results shown 

in Tables 28 and 29. . - ' 

Observed-score correlations of the five- experimental tests with 
the corresponding test material of the six operational tests are shown 
in the first section of Table 29. (In this table, correlations for which 
directions and scoring are con3istent with each other are italicized.) 
For each item type, correlations are shown for both Rights and Formula 
directions and for both Rights and Formula scoring. In general, the four 
correlations for a given combination of operational test and experimental 
test are remarkably similar. On the other hand, there is considerable 
variation in correlations among the sets of correlations f9r different 
tests, --.ndicating that the size of the cprrelations is more a function of 
the particular test than of directions or scoring. Within each test, however 
with one minor exception (Practical Business Judgment— Secnion 3), the 
correlations between the operational tests administered and scored by 
Formula and the experimental tests administered and scored by Formula 

(r r ) are slightly higher than the correlations between operational tests 
Fg 

administered and scored by Formula and experimentrl tests administered and 
scored by Rights (r^ ) . Because of the constraints on the administrations, 
discussed above, it was not possible to observe' the comparison between tests 



Table 29 

Observed-Score CorreUtlons, Reliabilities, and True-Score Correlations 
between Experlaental and Operational Tert Sections 



Directions tot 



Observed-Score Correlations 
of ExperUiental Tests with ^ ^ 
Corresponding Operational Test * 



Operational Test 
Reading Comprehension 

ProbleU Solving - 

Practical Business 

JudgMnt-Sef:tlon 3 

Practical Business 
JudgMcnt-Sectlon 5 

Data Sufficiency 
Usage 



Reliability Coefficients 



True-Score Correlations 

of Experimental Testj ^ 
with Operational Test * 



Experimental 
Tests 


u 
n 


'^FR 
o e 


'f F 

o e 


r'^ 
o o 


R 


5658 


.728 


.726 


.793 


F 


5739 


.733 


.732 


.796 


R 


5501 


.687 


.709 


.754 


F 


5594 


,735 


.740 


.760 


F 


5738 


.593 


. .589 


.660 


F 


5408 


.=.95 


.592 


.662 


R 


5738 


.543 


.536 


.619 


F 


5408 


.559 


.553 


.614 


R 


5590 


.676 


.716 


.822 


F 


5657 


.695 


.731 


.829 


R 


5409 


.670 


.hll 


.782 


F 


5486 


.672 . 


.677 


.781 



o o 



.766 
.771 



.625 
.631 

.S8§ 
.586 

.812 
.819 

.755 
.750 



'^R r'^ 
e e 



.8C4 
.868 

.719 
.736 

.748 
.752 

. 748^ 
.752* 

.751 
.768 

.782 
.796 



"^F F'^ 
«i e 



.836 
.848 

.686 

. 720 

.736 

.737 

.736* 
. 737^ 

.710 

. 722 

.758 
. 763 



t R 
^o e 



.900 
.896 

.940 
.988 

.868 
.864 

.818 
.842 

.866 
.877 

.873 
.869 



'F F 
o e 



.907 

.905 

.994 

1.006 

.869 
.868 

.816 
.840 

.944 

.^50 

.895 
.894 



•operational (Formila) scores used In this table are unrounded. Also, If scores are found to be negative, they are used as negative. 
^Correlations between variables for which the scoring Is consistent with the directions appear In Italics. 
^uder-Rlchardson Formula (20) reliability, 
^ressel (1940) adaptation of KR (20) reliability. 

•Hote that these reliabilities are Identical to the corresponding reliabilities shown for Practical Business Judgment-Section 3. 
Although there were two operational sections of this type. Section 3 and Section 5. there was only one such experimental section. 



I 



o 140 

ERIC 



141 



-128- 



that are both administered and scored by Rights (r^ ^ ) and tests that 

o e 

are administered and scored in different ways— e.g., r^, or r^^ ^ . 

o e o e 

' The second section of Table 29 gives KR (20) reliabilities for 
Rights and Formula scores on both the operational and experimental 
sections of the test. In this section of the Table, as in the first 
section, the italicized numbers apply to reliabilities calculated under 
scoring conditions consistent with the directions used in administering 
the tests. Several comments may be made on these data. First, it is seen 
that the reliabilities of four of the five experimental tests when adminis- 
tered and scored Rights are higher than the reliabilities of the same tests 
when administered and scored by Formula. The differences, however, are 
relatively Tmall, the largest being .029. Second, Rights scores yield 
higher reliability coefficients than do Formula scores in all 12 comparisons 
for: the operational tests and in all 10 comparisons for the experimental - 
tests. In the 22 comparisons, the differences range from .008 to .046, 
with a median of .026. These findings suggest that Rights scoring provides 
more reliable scores than does Formula scoring. On the other hand, the 
results for the experimental tests indicate that Formula directions yield 
higher reliability coefficients than do Rights directions, for both methods 
of scoring. Thus, when directions are compared. Formula directions yield 
higher reliabilities, and when scoring methods are compared, Rights scoring 
yields higher reliabilities. On the whole, then, the internal consistency 
reliability results do not provide an unequivocal basis for preferring 
Rights-directions-and-scoring to Formula-directions-and-scoring. 



142 



-129- 



The data also permit a comparison of the reliability coefficients 
of the operational and experimental tests. The table below gives the 
numbers of Items and time limits for each operational and expeiiiaental 



teat. 



Numbers of Items and Time Limits for the 
Operational and Experimental Tests 



Test Section 



Reading Comprehension 
Problem Solving 

, ^ Section 3 

Practical Business Judgment section 5 

Data Sufficiency 

Usage 



operational 


Experimental 


No. of 
Items 


Time 
(Mlns.) 


No. of 
Items 


Time 
(Mlns.) 


25 


30 


29 


30 


30 


40 


25 


30 


: 20 
: 20 


20 
20 


32 


30 


30 


30 


40 


30 


25 


15 


* 

30 


30 



Results for reliability coefficients of Formula scores shovm In Table 29 
indicate that In all eight comparisons for which time limits of the 
experimental and operational tests were different (Problem Solving, 
Practical Business Judgment, and Usage/Sentence Correction), the test 
having the longer time limit was more reliable. In the four comparisons 
based on operational and experimental tests having equal time limits, the 
experimental test had the higher reliability for Reading Comprehension and 
the operational test had the higher reliability for Data Sufficiency. 

The third section of Table 29 glvtes estimates of true-score correla- 
tions between each operational test ami its corresponding experimental test. 



*The experimental section corresponding to the operational Usage section 
consisted of Sentence Correction items. 

143 



-130- 

administered and scored under each of the two modes» Formula and Rights. 
The study design provided internal-consistency estimates of reliability 
for the operational and experimental tests. To the extent that the observed- 
score correlations involved sources of error not present in the KR <20) and 
Dressei determinations of reliability, and to the extent that these deter- 
minations underestimate the reliability, the estimates of the true-score 
correlations are too high. With the exception of the Problem Solving test 
the estimated correlations of true scores are noticeably lower than the 
value of 1.00 that woujd be expected for parallel tests. This divergence 
from parallelism is probably attributable, in large part, to the fact that 
pretests rather than operational tests were used in the experiment. The 
fact that the SAT-verbal parts in the first phase of this study yielded 
coefficients nearer to 1.00 would be consistent with this interpretation 
because the SAT-verbal tests were operational forms. It is plausible, also, 
that variation in results for different item types may arise because it is 
more difficult to construct strictly parallel tests for some item types than 
for others. 

Despite their limitations, the data provide some useful comparisons 
of true-score correlations obtained when directions and scoring are the 
same for the two tests with corresponding correlations obtained when 
directions and scoring differ from one test to the other. For five of the 
six testd, the same. directions yield higher true-acore correlations than do 
different directions; for the remaining comparisons, the correlations are 
equal. There are, however, only two comparisons for which the difference 



-131- 

exceeds .022. As It happens, both of the comparisons that yield relatively 
large differences Involve quantitative tests* For Problem Solving, the 
difference Is .066, and for Data Sufficiency, it is .084. These results 
suggest the possibility that quantitative tests given under different 
directions should not be regarded as strictly parallel. The limited data 
of this study, however, do not permit a firm conc.^.uslon on this point. 
Effects of Dtrections on Score Equating: Method of Analysis 

Two main approaches were used in determining the effect of differ- 
ences in directions on score equating. In the first approach, each of 
the five operational parts was equated to itself. In the second approach, 
each of the five experimental tests given under Rights directions was 
equated to the corresponding experimental test given under Formula 
directions. 

The first approach called for equating scores on an operational test 
to scores on the same operational test by the following three methods: 

1. Identity Method . When a test is equated to itself, the 
ideal equating line has, by definition, a slope of 1 and an intercept of 0 
and provides a standard with which results of other methods may be compared. 

2. Invariant Link Method . In this method, each group takes 
one of the two tests that are to be equated. In addition, both groups take 
the same link test items, but under different guessing directions. One 
group takes the link test under Rights directions and the other group takes 
the link test under Formula directions. Equating is performed by rescoring 
by Formula the link test taken under Tvl^hts directions and assuming that 

ERIC 



-132- 



such scores can be treated as Interchaiigeable with Formula scores earned 
under Formula directions. The analytical method used for treating these 
data Is then Identical to that for Maximum Likelihood equating, described 
below. 

. t In this part of the study an operational test section was 

equated to the same operational test section as though two different tests 
were Involved, using the data of the two spiralled groups that took the 
same experimental section (but under different directions). The experi- 
mental test given under Rights directions wa;. rescored by Formula and 
used as the link test, as described In the preceding paragraph. 

3. SplraVlllng Method . This method calls for distributing 
the tests In sequence, Jlthln each room In which the test Is administered. 
As a result of this process, the samples of students taking each form will 
represent systematic samples of the tox. 1 group tested. According to 
probability theory, each subsample will ten4 to become Increasingly similar 
to the other subsamples as sample sizes Increase. Thus, for large samples 
It can be assumed that any two subsamples are approximately equal In the 
abilities measured by the tests to be equated. Scores on two tests are 
equated by setting equal the means and standard deviations of the samples 
taking those two tests. The result of the equating Is that transformed 
scores on one fcest will have the same mean and standard deviation as the 
observed scores t>n the other test. (For a fuller discussion of this method, 
see Angoff (1971, pp. 569-571).) 

Here, an operational test section was again equated, as It was by 



ERIC 



AG 



i 



-133- 



the Invariant Link Method, to the same operational test Stictlon as though 
two dlffetent tests were Involved^ using only the data of the two spiralled 
groups of students, as described In the preceding paragraph, and without 
the use of the experimental test scores as a link. 

In this first approach, the primary Interest was In comparing the 
results obtained using the Invariant Link Method with those obtained by 
the Identity and Spiralling methods. 

Although it was not possible to express the results of these equatlngs 
on the customary GMAT scale, it was decided to establish an arbitrary scale 
for each part score, so defined that the mean converted score would be 500 
and the standard deviation of converted scores would be 100 for the total 
study sample. In this way, equating results would be expressed on a scale 
similar to the GMAT Total score scale. 

The second main approach called for equating Rights scores on an 
experimental test administered under Rights directions to Formula scores 
on the same experimental test, administered under Formula directions. Two 
methods of equating were used: ^ 

1. Maximum Likelihood Method . This method calls for admin- 
istering each of the two tests to be equated to a random sample of a 
suitable group of students and administering the same link, or anchor, test 
to all members of both samples. In this study, the operational part 
corresponding to each pair of experimental tests served as the link test. 
The analytical procedure calls for the estimation of the mean and variance 
' of both tests for the total combined sample, and for setting equal the 



ERIC i47. 



-134- 

estlmated means and standard deviations for the two tests, as is done in 
the Spiralling Method. The link test serves to increase the precision of 
the equating results. This method is described fully by Angoff (1971, 
pp. 576-579). 

2. Invariance Method . In this method, the equating is based 
on the results of a single test administered to a single group. A test 
given under Rights directions and scored Rights is also scored by Formula. 
It is then assumed that the Formula scores so obtained are equivalent to 
the Formula scores that would have been obtained had Formula directions 
as well as Formula scoring been employed for that group. The equating 
procedure then calls for the direct equating of Rights scores to Formula 
scores for the same individuals by setting equal their means and standard 
deviations on the two types of scores. This procedure was carried out for 
the experimental tests using the data, in each instance, for the group taking 
the test under Rights directions. 

This phase of the analysis made it possible to compare results obtained 
by the Invariance Method with those obtained by the Maximum Likelihood 
Method, which is a standard equating method. 

In order to express the results of these equatings on a scale similar 
to the GMAT Total score scale, it was decided to equate Formula scores on 
each experimental test to Formula scores on the corresponding operational 
part, and to use these equations in conjunction with equations already 
developed relating Formula scores on each part to the arbitrary scale. The 
equating of experimental tests to the corresponding operational parts was 



US 



-135- 



donc by setting means and standaru deviations equal for examinees who took 
both tests. Algebraic solution of each pair of equations yielded equations 
relating Formula scores on the experimental tests to the arbitrary scale 
for each part. These results, when used along with the equations relating 
Rights scores on the experimental tests to Formula score^ on the experi- 
mental tests made it possible to write equations to convert Rights scores 
on the experimental tests to converted scores in the units of the arbitrary 
scales. * 

Effect of Directions on Score Equating: Findings 

Results obtained for equating each operational part to itself are 
shown in Table 30. In this table, the Identity and Spiralling methods yield 
results that do not Involve the Invariance Hypothesis; the Invariant Link 
Method, however, does involve this hypothesis. 

If the Identity Method results are taken as the standard of comparison, 
consideration of the slope values shows that the Invariant Link Method 
agrees more closely with the Identity Method than does the Spiral ling Method 
in four of the five comparisons. For the 15 sets of results at selected 
points on the raw score scale, the Invariant Link agrees more closely with 
the Identity Method in seven comparisons, the Spiralling Method agrees more 
closely in four comparisons, and there are four t^es. There is a marked 
similarity between the results of the Spiralling and Invariant Link 
methods for selected points. For mean scores, only one difference be- 
tween the Spiralling and the Invariant Link results is as large as two 
converted score points, and in the remaining ten comparisons, only one 
difference is as large as three converted score jJoints. It should be noted. 



ERIC 




Table 30 



Conversion Parameters Relating Formula Scores on Each Part of Operational G^T 
to Formula Scores on Same Part as Determined by Various Mefhods of Equating?* 



Part of GMAT 



Rea4ing Comprehension 
Reading Comprehension 
Reading Co.mprehen^on 

i obl^m Sol-'ing 
Problem Solving 
Problem Solving 

Practical Business Judgment 
Practical Business Judgment 
Practical Business Judgment 

Data Sufficiency 
Data Sufficiency 
Data Sufficiency 

Usage 
Usage 
Usage 



Number 


Equating 


of Items 


Method 


25 


Identity 


25 


Spiral 


25 


Inv.Link 


30 


Identity 


30 


Spiral 


30 


Inv.Link 


40 


Identity 


40 


Spiral 


40 


Inv.Link 


30 


Identity 


30 


Spiral 


30 


Inv.Link 


25 


Identity 


25 


Spiral 


25 


Inv.Link 



Parameters 



Concerted Score When 
Raw (Formula) Score isj 



Slope 


Intercept 


Chance 


Mean^ 


Perfect 


18.6644 
18.6588 


250.0109 
250.2330 
251.5824 


250 
250 
252 


500 
502 
501 


717 
720 
718 


21.5332 
21.8347 
22.0155 


311.4144 
311.5242 
309.8210 


311 
312 
310 


500 
503 
503 


957 
967 
970 


14.655" 
14,6874 
14.6669 


222.2048 
222.9918 
221.2698 


222 
223 
221 


500 
501 
499 


608 
819 
P"8 


15.9482 
16.2528 
16.2241 


323.0691 
318.5462 
317.5239 


323 
319 
318 


500 
499 
498 


802 
806 
804 


18.4257 
18.1806 
18.2304 


294.1554 
297.8571 
297.4223 


294 
298 
297 


500 
501 
501 


755 
752 
753 



*Each part score was expressed on a scale defined to have a mean of 500 and a standard 
deviation of 100 for the total group (N«55,780). 

^FR?^'d using mean score of total group for each part, ae follox)s: Reading Comprehension, 13.3939| 
Solving. 8.7579; Practical Business Judgment, 18.9554; Data Sufficiency, 11.0941; and 



-137- 



however, that both the Spiralling and the Invariant Link methods differ 
substantially from the Identity Method for perfect scores on the Problem 
Solving test. The results of this analysis may be interpreted as favorable 
to the usefulness of the Invariant Link Method under the conditions of the 
study. 

Results shown in Table 31 permit a comparison of the Invariance 
Method with the Maximum Likeli^o >d Method. The question of special interest 
is whether there is evidence of systematic differences in results between 
the two methods. With respect to slope parameters, the Maximum Likelihood 
Method yields a larger value in two of the five comparisons, the Invariance 
Method yields a larger value in two comparisons, and in the fifth comparison, 
the slopes are equal. Results for the selected raw score levels show a 
higher converted score for Maximum Likelihood in seven instances, a higher 
converted score for Invariance in seven instances, and one tie. Among the 
15 comparisons only one difference exceeds four converted score points. For 
perfect scores on the Problem Solving test the Invariance Method yields a 
value 11 points higher than the Maximum Likelihood Method. These results 
are consistent with the other equating results in supporting that the 
hypothesis that Formula scores nay be considered to be invariant witR respect 
to Rights and Formula directiors. 



151 

ERIC 



PI 



Table 31 

conversion Parameters Relating Rights Scores on Each Experimental Test 
to Formula Scores on the Same Experimental Test 



Experimental Test 



Reading Comprehension 
Reading Comprehension 

Problem Solving 
Problem Solving 

Practical Business Judgment 
Practical Business Judgment 

Data Sufficiency 
Data Sufficiency 

Sentence Correction 
Sentence Correction 



Parameters 



Converted Score When 
Raw (Rights) Score is 



Number 
of Items 


Equating 
Method 


Slope 


T ri t* <» 1" r <*r> t* 


Chance 


Mean^ 


Perfec 


29 


Max, Lik, 


16.1317 


257.9860 


352 


500 




29 


Invariance 


15.9962 


260.9961 


354 


501 


725 


25 


Max. Lik, 


24.8160 


259.6898 


384 


500 


880 


25 


Invariance 


25.3876 


255.9344 


383 


501 


891 


32 


Max, Lik, 


20.6554 


110.4282 


243 


499 


771 


32 


Invariance 


20.6l043 


109.0225 


241 


497 


768 


40 


Ma^, Lik, 


18.7842 


193.7417 


344 


501 


945 


40 


Invariance 


18.8972 


189.2450 


340 


499 


945 


30 


Max, Lik, 


19.1585 


172.4707 


287 


499 


Ikl 


30 


Invariance 


19.1585 


173.2982 


288 


500 


748 



tzx.oZ tl''::^J^:^::\.j:^'^^ p-", ....... .La ....^r. ..^...o^ ..^i 

avaminees who took both tests, 
^ERiCed using tnean Rights Score on experimental test for students who received Rights directions. 
Si^SBwn In Table 16 • 152 



-139- 



Summary and Conclusions 

Unlike the data p ovlded by the first phase of the study, In 
which operational forms of the SAT-verbal and the Chemistry Achievement 
Test were administered In a specially designed experiment, the GMAT 
data were taken from a regularly scheduled administration of the test. 
The entire group of about 55,000 examinees who took the operational 
form of the GMAT In October 1980 were divided, essentially at random. 
Into 10 approximately equal subgroups and assigned to take, in addition 
to the operational test, one of five available sections of pretest 
Items— Reading Comprehension, Problem Solving, Practical Business 
Judgment, Data Sufficiency, and Sentence Correction— under either 
Rights or Formula directions. Except for Sentence Correction, there 
was an operational section representing the same Item type. For 
Sentence Correction, the corresponding operational section Included 
Usage rather than Sentence Correction Items. The spiralled adminis- 
tration of these sections made It possible to compare responses of 
examinees made under the two types of directions and also to compare 
the characteristics of both Rights scores and Formula scores for the 
two types of directions. 

The data provided by the G>' .T phase of the study confirm the 
conclusion, drawn from the S^T data, that the response strategies of 
examinees are generally consistent with the Instructions they are given 
for guessing. As evidenced by cpunts of Items Omitted, Not Reached. 



-140- 

and Guessed, examinees do attempt more Items under Rights directions 
than under Formula directions. The result of this differential be- 
havior Is that they do, as expected, earn higher Rights scores under 
Rights directions than under Formula directions. However, when their 
answer sheets are rescored by Formula, It Is found that the differences 
for examinees taking the tests under the two directions are virtually 
zero. This finding gives clear support to the Invarlance Hypothesis, 
which is that Formula-scoring compensaties for differences in guessing 
strategies caused by differences in directions. One Interpretation 
of this finding is that although some students may Indeed Improve 
their scores by guessing on the basis of partial knowled e, other 
students appear to diminish their scores because they guess on the 
basis of misinformation. On the average, however, contrary to the 
Differential Effects Hypothesis, the guesses of all students taken 
together appear to be no better than chance. Also, as expected, 
examinees at lower ability levels show larger numbers of Omitted and 
Not Reached items than higher -ability examinees. However, contrary to 
the results found in the SAT study, the difference between the effects 
of the two sets of directions was smaller for high-ability students 
than for low-ability students. 

The data nf this phase of the study point to higher KR (20) reli- 
abilities for Rights scoring, although there is a possibility that 
individual predilections to leave items unanswered may tend to inflate 
the reliability coefficients. This question regarding the interpretation 




of the reliability data will also bear somewhat on the question of 
the parallelism of Rights tests and Formula tests. In any case, it 
appears that two types of administration and scoring are not so 
different as to cause doubt regarding parallelism, at least in the 
case of the verbal subtests. Data for the quantitative subtests are 
less clear; questions of parallelism may well need closer scrutiny for 
quantitative types of items. 

As would be anticipated from the results of the examination of 
the two opposing hypotheses, Invarlance vs Differential Effects, the 
methods of equating that make use of the Invarlance Hypothesis are in 
excellent agreement with those that are taken as criterion methods. 
These results are highly encouraging with respect to future attempts to 
equate Rlghts-admlnlstered-and-scored tests to Formula-admlnistered-and- 
scored tests. 



ERIC 



-142- 



IMPLICATIONS OF FINDINGS 



The data provided by the studies of College Board SAT-verbal. 
College Board Chemistry, and GMAT have helped considerably to clarify 
several of the issues relating to methods of administration and scoring 
of standardized tests. As described in the early pages of both parts 
of this report, the studies undertaken here were designed with several 
purposes in mind. Principal among these was the question whether 
Rights-administered-and-scored lests could be equated to Formula- 
administered-and-scored tests without endangering the continuity of 
meaning of the scale. But in the process of considering various methods 
for carrying out such equating ope nationally and for developing other 
equating methods and conversion equations as criteria for evaluating 
possible operational methods, it became clear that an assumption basic 
to these methods had to be satisfied. This was the assumption that on 
the whole, students respond by guessing, under Rights directions, to items 
that they would normally omit when confronted with the penalty for guessing 
imposed on them in Formula scoring, an assumption formally stated by Lord 
(1978). Therefore, granted this assumption, it was plain that although 
Rights scores earned under the two types of directions would be markedly 
different, Formula scoring would tend to obliterate these differences. 

The foregoing assumption, which is the basis of the Invariance 
Hypothesis, was supporjted by the data in both studies reported here, 
the College Board studies and the GMAT study. As expected, the 



-143- 



Rights scores of examinees tested under Rights directions were much 
higher than the Rights scores of examinees testf.d under Formula di- 
rections. But when the answer sheets for the two groups of examinees 
were rescored by Formula, the differences between the groups virtually 
disappeared. Moreover, studies of SAT-verbal examinees indicate uhat 
this finding applies not only overall, but also separately at different 
levels of ability. Thus, it is not true, as might have been expected, 
that students at some levels of ability are more perceptive regarding 
their assessments of their own knowledge than students at ocher levels of 
ability. Apparently, students at all levels of ability are equally unable 
to discern differences in their own levels of competence at the edges of 
their competencies. Guessing at those edges appears to be as much in- 
fluenced by misinformation as by valid information. 

The fact that the Invariance Hypothesis is supported, not only 
overall, but for examinees at all levels of ability, is of considerable 
importance for at least three reasons: One, it disconfirms the assertion 
made in the Differential Effects Hypothesis, which is that students are 
disadvantaged by Formula directions, that they would be better advised 
to guess, even in a Formula-scored test, since their scores would be 
higher, on the average, than if they did not guess. The fact is, however, 
that their scores would not be higher if they guessed than if they did 
not guess. Moreover, it seems to be assumed by the proponents of the 
Differential Effects Hypothesis that Rights directions equalize the 
advantage for all students, because Rights directions encourage students 



ERIC 



157 



-144- 



to respond to all the items. However, as we have observed in this study, 
the numbers of items Omitted and Not Reached, although smaller in Rights- 
directed than in Formula-directed tests, are still substantial; contrary 
to hypothesis, students do not all respond to all the items, in spite of 
the strong directions. Two, and most central to the particular purpose 
of this investigation, the evidence for the Invariance Hypothesis 
makes it possible to equate Rights scores to Formula scores without 
experiencing unacceptably large slippage in the scale, even under conditions 
of test disclosure, were our programs to change from Formula-scoring to 
Rights-scoring. As the studies of equating Rights to Formula indicate, 
the use of the Invariance Hypothesis makes the transition entirely feasible. 
Three, it is important to observe that the confirmation of the Invariance 
Hypothesis implies that since Formula scoring has the effect of compen- 
sating for, or equalizing, differences in behavior resulting from different 
directions for guessing, it also has the effect of compensating for dif- 
ferences in individual student strategies for guessing. Not only does 
this property of Formula- scoring have significance for easing the trans- 
ition from Formula-scored tests to Rights-scored tests, it also has a 
more basic significance for the test administration itself. 

The tabulations of the nonresponse data confirm the findings made 
in other analyses of the data: Nonresponse is a function of the directions 
given in the administration, and also a function of ability level, bat 
not, at least as evidenced in these data, a function of ethnicity. There 
are fewer items Omitted, Not Reached, and Not Attempted, and, correspondingly, 



-145- 



more items Guessed (as measured by either the W-0 or the Ziller index) 
for students tested under Rights directions than for students tested 
under Formula directions. Also, there are fewer items Omitted, Not 
Reached, and Not Attempted by more able than by less able students. 
To what extent this finding is a function of ability in the abstract 
sense and to what extent it is a function of the constraint imposed on 
the scores by the number of items in the tesc is difficult to know. 
Concerning whether abler students respond more appropriately to directions 
for guessing than less able students, the data of the two studies yielded 
inconsistent results. 

The fact that examinees answered more items under Rights directions 
than under Formula (directions is in accordance with expectations. However, 
the expectation that every examinee would answer every item under Rights 
directions was by no means fulfilled, despite the fact that the instruc- 
tions stated explicitly that it would be to their advantage to do so. 
These results emphasize the importance of systematic efforts to encourage 
examinees to answer every question if Rights-directions-and-scoring are 
adopted for operational testing. Indeed, under operational conditions, 
a determined effort to minimize the number of unanswered it^ms may be 
considered to be an important step in maintaining uniform testing 
conditions for all examinees. 

Whether guessing is a function of ability level is difficult to 
say. This, it appears, depends on the operational definition of guessing 
one is willing to accept. As was pointed out earlier in this report. 



ERLC 



159 



-146- 



the tendency to guess, when conceived of abstracOly as a personality 
trait, is probably uncorrected with cognitive ability; on the other 
hand, guessing behavior, certainly when derived from the test responses 
themselves, would necessarily be correlated with test score. Whether 
guessing behavior is better expressed as the index, W-0, or as a pro- 
portion of noncorrect responses, as in the Ziller index, or in some index 
other than either of these, is indeterminate. Yet it is basic to our 
conclusions oecause in one respect, at least, the two indices lead to 
different conclusions: the W-0 index is negatively correlated with test 
scores; the Ziller index is positively correlated with test scores, but 
in general, the absolute size of the coefficients is smaller. 

Although the parallel-forms reliabilities are virtually equal to 
the two types of administration and scoring, the KR (20) reliability 
coefficients are not: the reliabilities for Rights-administered-and- 
scored tests have a small, but consistent edge over the reliabilities for 
the Formula tests. Here too, however, the interpretation is not entirely 
clear. If there are consistent differences among individuals with respect 
to the tendency to guess, such differences will inevitably become con- 
founded with the scores themselves, but in such a way as to inflate the 
reliability coefficients, however they are calculated; and until guessing 
as a personality trait can be reliably measured and shown to correlate 
more with one type of administration than the other, this question too 
must remain indeterminate. 

The point has often been made that the issue of Ri&hts vs Formula 



-147- 



l8 a trivial one because the two scores are so highly correlated; given 
a set of answer sheets, the correlation between the two scores is usually 
in excess of .98 or even .99. Quite aside from the appropriateness of 
the conclusion of trivialness, the evidence in support of it is clearly 
spurious inasmuch as both scores are based on the same set of test 
responses and therefore must perforce by highly correlated. An appro- 
priate way to evaluate this question, it is submitted, is to assemble 
data of the sort designed in the study of SAT-verbal, in which randomly dif- 
ferent groups take the same pair of tests under the two conditions of 
administration. These data make it clear that in fact two tests admin- 
istered and scored in the same way. Rights or Formula, correlate more 
highly than two tests administered and scored in different ways. However, 
the differences amount to only about .02, on the average, when the cor- 
relations are in the vicinity of .80. 

Closely related to the foregoing question is the question of paral- 
lelism of the Rights-administered-and-scoring mode vs the Formula- 
administered-and-scoring mode. The data from the SAT-verbal study are 
clear on this point: Although it is t.u3 that true-score correlations 
between tests that are administered and scored in the same mode are 
higher than true-score correlations between tests that are administered 
and scored in different modes, the differences are small. In any case, 
the true-score correlations between Rights and Formula tests are close 
enough to unity to dispel any concerns that the two types of administration 
and scoring are measuring different abilities. 



ERIC 



161 



-148- 



The data from the GMAT study are less clear on this point. The 
results of the verbal tests are essentially in agreement with the SAT 
data, differing from the latter chiefly in the respect that some of 
the GMAT subtests may be less homogeneous and therefore less reliable 
in the KR (20) sense than the SAT subtests. The differences observed 
in the case of the GMAT quantitative tests, however, are somewhat larger. 
The assumption of parallelism for such tests may not be fully warranted. 

The implications of the findings of the College Board and GMAT 
studies for the success of equating efforts in effecting a change from 
Formula-type tests to Rights-type tests are on the whole quite positive. 
The methods of equating that have been examined here for possible use in 
operational equating work have made use of the Invariance Hypothesis, and, 
as expected from, the earlier confirmation of this hypothesis, these methods 
yield results that are in good agreement with other, more nearly ideal 
procedures. Even if these results fall short of expectations, the data 
of the study have made it clear that students can and do shift their mode 
of response to test items in accordance with changes in directions for 
guessing, and moreover, appear to do so even in operational test admin- 
istrations, when they might be expected to perceive that a particular 
test section is experimental and will not count toward their score. 
Supported by evidence of this sort, and supported further by the results 
of these studies that show that differences in guessing strategies tend 
to be overcome and removed by Formula scoring, we may feel encouraged 
that still other methods of equating may be developed to supplement those 



-149- 



examined in this study to enlarge the range of possible solutions to the 
problems of equating across a transition. 

Beyond the purposes for which this study was' designed , and the in- 
sights it has permitted into a set of issues that have so long been the 
subject of controversy, some mention should be made of the value of the 
type of experimental design used in this study, one that is likely, if 
adopted by other investigators in the future, to clarify still other 
issues yet unresolved. As was pointed out in the early pages of this 
report, the present study of Rights ar^d Formula scoring is the only one 
to our knowledge that has been based on very large samples and designed 
in a symmetrical fashion, with an essentially random half of the examinees 
exposed to one type of directions for guessing and the other half exposed 
to another type of directions. This arrangement, supported, when possible, 
with additional, relevant test scores administered in the same way to 
everyone, as was the case in this study, and with background data—age, 
sex, and ethnic membership, for example— would serve to enhance the in- 
formational quality Qf future studies. 

As a result of the random assignment of large groups of students to 
the two types of directions and the use of ability and background controls, 
these two sets of data— the SAT-verbal and Chemistry Achievement Test data, 
atwi the GMAT data— have considerable value for other studies of the effects 
of test administration and scoring. These could involve, for example, 
studies of speededness under the two conditions of administration (some 
of which have already been done), modifications of the W-0 and the Ziller 



163 



-150- 



Indices of guessing, more detailed examination of the Invariance Hypothesis 
in the GMAT administration as a function of ability level, studies of 
scoring accuracy, studies of parameter estimation for important applications 
of item response theory, studies of conventional methods of equating, and 
undoubtedly many others. Such studies could be undertaken and carried 
out to great advantage without the substantial costs of special administratiot 
costs which often tend to inhibit the conduct of potentially useful studies. 



-151- 



References 

Abu-Sayf, F. K. The scoring of multiple-choice tests: A closer look. 

Educational Technology . 1979, 19, 5-15. 
Angoff, W. H. Scales, norms, and equivalent scares. In R. L. Thorndlke 
.(Ed.) Educational Measurement . Washington, D.C. : American Council 

on Education, 1971. 
Boldt, R. F. Study of linearity and homoscedasticity of test scores in 

the chance score range. Educational and Psychol ogical Measurement^ 

1968, 28, 47-60. 

Cross, L. H. and Frary, R. B. An empirical test of Lord's theoretical 
results regarding formula scoring of multiple-choice tests. Journal 
of Educational Measurement , 1977, 14, 313-321. 

Cureton, E. E. The correction for guessing. Journal of Experimental 
Education , 1966, 34 (4), 44-47. 

Davis, F. B. A note on the correction for chance success. Journal of 
Experimental Education , 1967, 35 (3), 42-47. 

Diamond, J. and Evans, W. The correction for guessing. Review of 
Educational Research , 1973, 43, 181-191. 

r-essel, P. L. Some remarks on the Kuder-Richardson reliability coef- 
ficient. Psvchometrika , 1940, 5, 305-310. 

Ebel, R. Measuring educational achievement . Englewood Cliffs, N.J. ! 
Prentice-Hall, 1965. 

Ebe], R. L. Blind guessing on objective achievement tests. Journal of 
Educational Measurement, 1968, 5^, 321-325. 



ERIC 



-152- 



Glass, G. V and Wiley, D. E. Formula scoring and test reliability. 
Journal of Educational Measurement , 1964, 43-47. 

Levine, R. and Lord, F. M. An index of the discriminating power of a 
test at different parts of the score range. Educational and Psy- 
chological Measurement , 1959, 19, 497-503. 

Lord, F. M. Formula scoring and validity. Educational and Psychological 
Measurement , 1963, 23, 663-672. 

Lord, F. M. Relative efficiency of number-right and formula scores . 
Research Bulletin 74-9. Princeton, N.J.: Educational Testing 
Service, 1974. 

Lord, F. M. Formula scoring;=and number-right scoring. Journal of 

Educatfional Measurement , 1975, 12 , 7-11. 
Lord, F. M. Practical applications of item characteristic curve theory. 

Journal of Educational Measurement , 1977, 14, 117-138. 
Lord, F. M. Applications of item response theory to practical testing 

problems . Hillsdale, N.J.: Lawrence Erlbaum Associates, 1980. 
Lord. F. M. and Novick, M. R. Statistical theories of mental test scores . 

Reading, Mass.: Addison-Wesley , 1968. 
Rowley, G. L. and Ttaub, R. E. Formula scoring, number-right scoring, 

and te&t-taking strategy. Journal of Educational Measurement , 1977, 

14, 15-22. 

Sherriffs, A. C. and Boomer, D. S. Who is penalized by the penalty for 
guessing? Journal of Educational Psychology , 1954, 45, 81-90. 

Slakter, M. J. Risk taking on objective examinations. American Educational 
Research Journal , 1967, j4, 31-43. 

J. O* l.> 



-153- 



Slakter, M. J. The penalty for not guessing. Journal of Educational 
Measurement . 1968, 5, 141-144(a). 

Slakter, M. J. The effect of guessing strategy on objective test scores. 
Journal of Educational Measurement , 1968, 5^, 217-226(b). 

Slakter, M. J. Generality of risk taking on objective examinations.' 
Educat^j-onal and Psychological Measurement , 1969, 29, 115-128. 

Stanley, J. C Psychological correction for chance. Journal of Experi- 
mental Education , 1954,-22, 297-298. 

Swineford, F. The measurement of a personality trait. Journal of 
Educational Psychology , 1938, 29, 295-300. 

Swineford, F. Analysis of a personality trait. Journal of Educational 
Psychology . 1941, 32, 438-444. 

Thorndike, R. L. (Ed.) Educational Measurement . Washington, D.C. : American 

Council on Education, 1971. 
Votaw, D. F. The effect of do-not-guess directions upon the validity of 
true-false or multiple-choice tests. Journal of Educational Psychology, 
1936, 2£, 698-703. 

Zlller, R. C. A measure of the gambling response-set in objective tests. 
Psvchometrika , 1957, 22, 289-292. 



